All of testingthewaters's Comments + Replies

I mean, this applies to humans too. The words and explanations we use for our actions are often just post hoc rationalisations. An efficient text predictor must learn not what the literal words in front of them mean, but the implied scenario and thought process they mask, and that is a strictly nonlinear and "unfaithful" process.

I think I've just figured out why decision theories strike me as utterly pointless: they get around the actual hard part of making a decision. In general, decisions are not hard because you are weighing payoffs, but because you are dealing with uncertainty.

To operationalise this: a decision theory usually assumes that you have some number of options, each with some defined payout. Assuming payouts are fixed, all decision theories simply advise you to pick the outcome with the highest utility. "Difficult problems" in decision theory are problems where the p... (read more)

2cubefox
The theories typically assume that each choice option has a number of known mutually exclusive (and jointly exhaustive) possible outcomes. And to each outcome the agent assigns a utility and a probability. So uncertainty is in fact modelled insofar the agent can assign subjective probabilities to those outcomes occurring. The expected utility of an outcome is then something like its probability times its utility. Other uncertainties are not covered in decision theory. E.g. 1) if you are uncertain what outcomes are possible in the first place, 2) if you are uncertain what utility to assign to a possible outcome, 3) if you are uncertain what probability to assign to a possible outcome. I assume you are talking about some of the latter uncertainties?
3mako yass
I think unpacking that kind of feeling is valuable, but yeah it seems like you've been assuming we use decision theory to make decisions, when we actually use it as an upper bound model to derive principles of decisionmaking that may be more specific to human decisionmaking, or to anticipate the behavior of idealized agents, or (the distinction between CDT and FDT) as an allegory for toxic consequentialism in humans.

This has shifted my perceptions of what is in the wild significantly. Thanks for the heads up.

https://research.google/blog/deciphering-language-processing-in-the-human-brain-through-llm-representations/

Activations in LLMs are linearly mappable to activations in the human brain. Imo this is strong evidence for the idea that LLMs/NNs in general acquire extremely human like cognitive patterns, and that the common "shoggoth with a smiley face" meme might just not be accurate

That surprisingly straight line reminds me of what happens when you use noise to regularise an otherwise decidedly non linear function: https://www.imaginary.org/snapshot/randomness-is-natural-an-introduction-to-regularisation-by-noise

I think this is a really cool research agenda. I can also try to give my "skydiver's perspective from 3000 miles in the air" overview of what I think expected free energy minimisation means, though I am by no means an expert. Epistemic status: this is a broad extrapolation of some intuitions I gained from reading a lot of papers, it may be very wrong.

In general, I think of free energy minimisation as a class of solutions for the problem of predicting complex systems behaviour, in line with other variational principles in physics. Thus, it is an attempt to ... (read more)

It's hard to empathise with dry numbers, whereas a lively scenario creates an emotional response so more people engage. But I agree that this seems to be very well done statistical work.

Hey, thank you for taking the time to reply honestly and in detail as well. With regards to what you want, I think that this is in many senses also what I am looking for, especially the last item about tying in collective behaviour to reasoning about intelligence. I think one of the frames you might find the most useful is one you've already covered---power as a coordination game. As you alluded to in your original post, people aren't in a massive hive mind/conspiracy---they mostly want to do what other successful people seem to be doing, which translates ... (read more)

0Garrett Baker
At least reading the wikipedia, this... does not seem so self-conscious to me. Eg. and These are not exactly hard-hitting or at all novel or even interesting criticisms. And they're not even criticisms of humanities! So how can it be self-conscious?

Hey, really enjoyed your triple review on power lies trembling, but imo this topic has been... done to death in the humanities, and reinventing terminology ad hoc is somewhat missing the point. The idea that the dominant class in a society comes from a set of social institutions that share core ideas and modus operandi (in other words "behaving as a single organisation") is not a shocking new phenomenon of twentieth century mass culture, and is certainly not a "mystery". This is basically how every country has developed a ruling class/ideology since the te... (read more)

Thanks for the well-written and good-faith reply. I feel a bit confused by how to relate to it on a meta level, so let me think out loud for a while.

I'm not surprised that I'm reinventing a bunch of ideas from the humanities, given that I don't have much of a humanities background and didn't dig very far through the literature.

But I have some sense that even if I had dug for these humanities concepts, they wouldn't give me what I want.

What do I want?

  1. Concepts that are applied to explaining current cultural and political phenomena around me (because those ar
... (read more)

Yeah, I'm not gonna do anything silly (I'm not in a position to do anything silly with regards to the multitrillion param frontier models anyways). Just sort of "laying the groundwork" for when AIs will cross that line, which I don't think is too far off now. The movie "Her" is giving a good vibe-alignment for when the line will be crossed.

Ahh, I was slightly confused why you called it a proposal. TBH I'm not sure why only 0.1% instead of any arbitrary percentage between (0, 100]. Otherwise it makes good logical sense.

2Daniel Kokotajlo
Thanks! Lower percentages mean it gets in the way of regular business less, and thus is more likely to be actually adopted by companies.

Hey, the proposal makes sense from an argument standpoint. I would refine slightly and phrase as "the set of cognitive computations that generate role emulating behaviour in a given context also generate qualia associated with that role" (sociopathy is the obvious counterargument here, and I'm really not sure what I think about the proposal of AIs as sociopathic by default). Thus, actors getting into character feel as if they are somehow sharing that character's emotions.

I take the two problems a bit further, and would suggest that being humane to AIs may ... (read more)

2Daniel Kokotajlo
Oops, that was the wrong link, sorry! Here's the right link.

Hey Daniel, thank you for the thoughtful comment. I always appreciate comments that make me engage further with my thinking because one of the things I do is that I get impatient with whatever post I'm writing and "rush it out of the door", so to speak, so this gives me another chance to reflect on my thoughts.

I think that there are approximately ~3 defensible positions with regards to AI sentience, especially now that AIs seem to be demonstrating pretty advanced reasoning and human-like behaviour. One is the semi mystical argument that humans/brains/embod... (read more)

2Daniel Kokotajlo
Yeah I mean I think I basically agree with the above. As hopefully the proposal I linked makes clear? I'm curious what you think of it. I think the problem of "Don't lose control of the AIs and have them take over and make things terrible (by human standards)" and the problem of "Don't be inhumane to the AIs" are distinct but related problems and we should be working on both. We should be aiming to understand how AIs work, on the inside, and how different training environments shape that--so we can build/train them in such a way that (a) they are having a good time, (b) they share our values & want what we want, and (c) are otherwise worthy of becoming equal and eventually superior partners in the relationship (analogous to how we raise children, knowing that eventually they'll be taking care of us). And then we want to actually treat them with the respect they deserve. And moreover insofar as they end up with different values, we should still treat them with the respect they deserve, e.g. we should give them a "cooperate and be treated well" option instead of just "slavery vs. rebellion." To be clear I'm here mostly thinking about future more capable and autonomous and agentic AI systems, not necessarily today's systems. But I think it's good to get started asap because these things take time + because of uncertainty. Have you heard of Eleos? You might want to get in touch.

This seems like an interesting paper: https://arxiv.org/pdf/2502.19798

Essentially: use developmental psychology techniques to cause LLMs to develop a more well rounded human friendly persona that involves reflecting on their actions, while gradually escalating the moral difficulty of the dilemmas presented as a kind of phased training. I see it as a sort of cross between RLHF, CoT, and the recent work on low example count fine tuning but for moral instead of mathematical intuitions.

Yeah, that's basically the conclusion I came to awhile ago. Either it loves us or we're toast. I call it universal love or pathos.

1Davey Morse
I appreciate your conclusion and in particular its inner link to "Self-Other Overlap." Though, I do think we have a window of agency: to intervene in self-interested proto-SI to reduce the chance that it adopts short-term greedy thinking that makes us toast.

This seems like very important and neglected work, I hope you get the funds to continue.

1Mordechai Rorvig
Thank you!

Yeah, definitely. My main gripe where I see people disregarding unknown unknowns is a similar one to yours- people who present definite worked out pictures of the future.

Note to self: If you think you know where your unknown unknowns sit in your ontology, you don't. That's what makes them unknown unknowns.

If you think that you have a complete picture of some system, you can still find yourself surprised by unknown unknowns. That's what makes them unknown unknowns.

If your internal logic has almost complete predictive power, plus or minus a tiny bit of error, your logical system (but mostly not your observations) can still be completely overthrown by unknown unknowns. That's what makes them unknown unknowns.

You can respect u... (read more)

1CapResearcher
I could feel myself instinctively disliking this argument, and I think I figured out why. Even though the argument is obviously true, and it is here used to argue for something I agree with, I've historically mostly seen this argument used to argue against things I agree with. Specifically, arguing to disregard experts, and to argue that nuclear power should never be built, no matter how safe it looks. Now this explains my gut reaction, but not whether it's a good argument. When thinking through it, my real problem with the argument is the following. While it's technically true, it doesn't help locate any useful crux or resolution to a disagreement. Essentially, it naturally leads to a situation where one party estimates the unknown unknowns to be much larger than the other party, and this is the crux. To make things worse, often one party doesn't want to argue for their estimate of the size of the unknown unknowns. But we need to estimate sizes of unknown unknowns, otherwise I can troll people with "tic-tac-toe will never be solved because of unknown unknowns". I therefore feel better about arguments for why unknown unknowns may be large, compared to just arguing for a positive probability of unknown unknowns. For example, society has historically been extremely chaotic when viewed at large time scales, and we have numerous examples of similar past predictions which failed because of unknown unknowns. So I have a tiny prior probability that anyone can accurately predict what society will look like far into the future.

The problem here is that you are dealing with survival necessities rather than trade goods. The outcome of this trade, if both sides honour the agreement, is that the scope insensitive humans die and their society is extinguished. The analogous situation here is that you know there will be a drought in say 10 years. The people of the nearby village are "scope insensitive", they don't know the drought is coming. Clearly the moral thing to do if you place any value on their lives is to talk to them, clear the information gap, and share access to resources. F... (read more)

3Cleo Nardo
Ah, your reaction makes more sense given you think this is the proposal. But it's not the proposal. The proposal is that the scope-insensitive values flourish on Earth, and the scope-sensitive values flourish in the remaining cosmos. As a toy example, imagine a distant planet with two species of alien: paperclip-maximisers and teacup-protectors. If you offer a lottery to the paperclip-maximisers, they will choose the lottery with the highest expected number of paperclips. If you offer a lottery to the teacup-satisfiers, they will choose the lottery with the highest chance of preserving their holy relic, which is a particular teacup. The paperclip-maximisers and the teacup-protectors both own property on the planet. They negotiate the following deal: the paperclip-maximisers will colonise the cosmos, but leave the teacup-protectors a small sphere around their home planet (e.g. 100 light-years across). Moreover, the paperclip-maximisers promise not to do anything that risks their teacup, e.g. choosing a lottery that doubles the size of the universe with 60% chance and destroys the universe with 40% chance. Do you have intuitions that the paperclip-maximisers are exploiting the teacup-protectors in this deal? Do you think instead that the paperclip-maximisers should fill the universe with half paperclips and half teacups? I think this scenario is a better analogy than the scenario with the drought. In the drought scenario, there is an object fact which the nearby villagers are ignorant of, and they would act differently if they knew this fact. But I don't think scope-sensitivity is a fact like "there will be a drought in 10 years". Rather, scope-sensitivity is a property of a utility function (or a value system, more generally).

Except that's a false dichotomy (between spending energy to "uplift" them or dealing treacherously with them). All it takes to not be a monster who obtains a stranglehold over all the watering holes in the desert is a sense of ethics that holds you to the somewhat reasonably low bar of "don't be a monster". The scope sensitivity or lack thereof of the other party is in some sense irrelevant.

1Cleo Nardo
If you think you have a clean resolution to the problem, please spell it out more explicitly. We’re talking about a situation where a scope insensitive value system and scope sensitive value system make a free trade in which both sides gain by their own lights. Can you spell out why you classify this as treachery? What is the key property that this shares with more paradigm examples of treachery (e.g. stealing, lying, etc)?
5Noosphere89
From who's perspective, exactly?

The question as stated can be rephrased as "Should EAs establish a strategic stranglehold over all future resources necessary to sustain life using a series of unequal treaties, since other humans will be too short sighted/insensitive to scope/ignorant to realise the importance of these resources in the present day?"

And people here wonder why these other humans see EAs as power hungry.

3Cleo Nardo
I mention this in (3). I used to think that there was some idealisation process P such that we should treat agent A in the way that P(A) would endorse, but see On the limits of idealized values by Joseph Carlsmith. I'm increasingly sympathetic to the view that we should treat agent A in the way that A actually endorses.

Hey, thanks for the reply. I think this is a very valuable response because there are certain things I would want to point out that I can now elucidate more clearly thanks to your push back.

First, I don't suggest that if we all just laughed and went about our lives everything would be okay. Indeed, if I thought that our actions were counterproductive at best, I'd advocate for something more akin to "walking away" as in Valentine's exit. There is a lot of work to be done and (yes) very little time to do it.

Second, the pattern I am noticing is something more... (read more)

Do not go gentle into that good night,

Old age should burn and rave at close of day;

Rage, rage against the dying of the light.

Though wise men at their end know dark is right,

Because their words had forked no lightning they

Do not go gentle into that good night.

Good men, the last wave by, crying how bright

Their frail deeds might have danced in a green bay,

Rage, rage against the dying of the light.

Wild men who caught and sang the sun in flight,

And learn, too late, they grieved it on its way,

Do not go gentle into that good night.

Grave men, near death, who see w... (read more)

niplav138

Because their words had forked no lightning they

I think we have the opposite problem: our words are about to fork all the lightning.

Ben Pace131

Thank you.

It does not currently look to me like we will win this war, speaking figuratively. But regardless, I still have many opportunities to bring truth, courage, justice, honor, love, playfulness, and other virtues into the world, and I am a person whose motivations run more on living out virtues rather than moving toward concrete hopes. I will still be here building things I love, like LessWrong and Lighthaven, until the end.

In my book this counts as severely neglected and very tractable ai safety research. Sorry that I don't have more to add but felt important to point it out.

Even so, it seems obvious to me that addressing the mysterious issue of the accelerating drivers is the primary crux in this scenario.

4romeostevensit
Yes, and I don't mean to overstate a case for helplessness. Demons love convincing people that the anti demon button doesn't work so that they never press it even though it is sitting right out in the open.

Epistemic status: This is a work of satire. I mean it---it is a mean-spirited and unfair assessment of the situation. It is also how, some days, I sincerely feel.

A minivan is driving down a mountain road, headed towards a cliff's edge with no guardrails. The driver floors the accelerator.

Passenger 1: "Perhaps we should slow down somewhat."

Passengers 2, 3, 4: "Yeah, that seems sensible."

Driver: "No can do. We're about to be late to the wedding."

Passenger 2: "Since the driver won't slow down, I should work on building rocket boosters so that (when we inevita... (read more)

unfortunately, the disanalogy is that any driver who moves their foot towards the brakes is almost instantly replaced with one who won't.

Driver: My map doesn't show any cliffs

Passenger 1: Have you turned on the terrain map? Mine shows a sharp turn next to a steep drop coming up in about a mile

Passenger 5: Guys maybe we should look out the windshield instead of down at our maps?

Driver: No, passenger 1, see on your map that's an alternate route, the route we're on doesn't show any cliffs.

Passenger 1: You don't have it set to show terrain.

Passenger 6: I'm on the phone with the governor now, we're talking about what it would take to set a 5 mile per hour national speed limit.

Passenger 7: Don't ... (read more)

This is imo quite epistemically important.

It's definitely something I hadn't read before, so thank you. I would say to that article (on a skim) that it has clarified my thinking somewhat. I therefore question the law/toolbox dichotomy, since to me it seems that usefulness - accuracy-to-perceived reality are in fact two different axes. Thus you could imagine:

  • A useful-and-inaccurate belief (e.g. what we call old wives tales, "red sky in morning, sailors take warning", herbal remedies that have medical properties but not because of what the "theory" dictates) 
  • A not-useful-but-accurate belief (wh
... (read more)
1brendan.furneaux
I'm not sure I understand how "red sky in morning, sailors take warning" can be both inaccurate and useful. Surely a heuristic for when to prepare for bad weather is useful only insofar as it is accurate?

Hey, thanks for responding! Re the physics analogy, I agree that improvements in our heuristics are a good thing:

However, perhaps you have already begun to anticipate what I will say—the benefit of heuristics is that they acknowledge (and are indeed dependent) on the presence of context. Unlike a “hard” theory, which must be applicable to all cases equally and fails in the event a single counter-example can be found, a “soft” heuristic is triggered only when the conditions are right: we do not use our “judge popular songs” heuristic when staring at a dinne

... (read more)
2Ben Pace
Relatedly, in case you haven't seen it, I'll point you to an essay by Eliezer on this distinction: Toolbox-thinking and Law-thinking.

And as for the specific implications of "moral worth", here are a few:

  • You take someone's opinions more seriously
  • You treat them with more respect
  • When you disagree, you take time to outline why and take time to pre-emptively "check yourself"
  • When someone with higher moral worth is at risk you think this is a bigger problem, compared against the problem of a random person on earth being at risk

Thank you for the feed back! I am of course happy for people to copy over the essay

> Is this saying that human's goals and options (including options that come to mind) change depending on the environment, so rational choice theory doesn't apply?

More or less, yes, or at least that it becomes very hard to apply it in a way that isn't either highly subjective or essentially post-hoc arguing about what you ought to have done (hidden information/hindsight being 20/20)

> This is currently all I have time for; however, my current understanding is that there... (read more)

2testingthewaters
And as for the specific implications of "moral worth", here are a few: * You take someone's opinions more seriously * You treat them with more respect * When you disagree, you take time to outline why and take time to pre-emptively "check yourself" * When someone with higher moral worth is at risk you think this is a bigger problem, compared against the problem of a random person on earth being at risk
4Ben Pace
Done.