All of Wei Dai's Comments + Replies

No, because power/influence dynamics could be very different in CEV compared to the current world and it seems reasonable to distrust CEV in principle or in practice, and/or CEV is sensitive to initial conditions implying a lot of leverage to influencing opinions before it starts.

Nina is worried not just about humans getting killed and replaced, but also about humans not being allowed to have unenhanced children. It seems plausible that most humans, after reflection, would endorse some kind of "successionist" philosophy/ideology, and decide that intentionally creating an unenhanced human constitutes a form of child abuse (e.g., due to risk of psychological tendency to suffer, or having a much worse life on expectation than what's possible). It seems reasonable for Nina to worry about this, if she thinks her own values (current or eventual or actual) are different.

3Mikhail Samin
(btw i expect we’ll really want enhanced humans to have the capacity to suffer, because we have preferences around future people being able to experience the kinds of feelings we experience when we read stories, including very sad stories. Some suffering is reflectively endorsed and we enjoy it/wouldn’t want it to not happen; and it seems fine to want new humans and enhanced current humans to also have it, although maybe with more access to some control over it.)
3Cole Wyeth
Certainly an aligned AI can be a serious threat if you have sufficiently unusual values relative to whoever does the aligning. That worries me a lot - I think many possible “positive” outcomes are still somewhat against my interests and are also undemocratic, stripping agency from many people.  However, if this essay were capable of convincing “humanity” that they shouldn’t value enhancement, CEV should already have that baked in?

The usual scaling laws are about IID samples from a fixed data distribution, so they don’t capture this kind of effect.

Doesn't this seem like a key flaw in the usual scaling laws? Why haven't I seen this discussed more? The OP did mention declining average data quality but didn't emphasize it much. This 2023 post trying to forecast AI timeline based on scaling laws did not mention the issue at all, and I received no response when I made this point in its comments section.

Even if it were true that the the additional data literally “contained no new ide

... (read more)
7cubefox
LeCun has written about this. Humans are already pretrained on large amounts of sensory data before they learn language, while language models are trained from scratch with language. The current pretraining paradigm only works well with text, as this text data is relatively low-dimensional (e.g. 2^16≈65.000 for a 16 bit tokenizer), but not with audio or video, as the dimensionality explodes. Predicting a video frame is much harder than predicting a text token, as the former is orders of magnitude larger. From a blog post: LeCun says that humans or animals, when doing "predictive coding", predict mainly latent embeddings rather than precise sensory data. It's currently not clear how this can be done efficiently with machine learning.
Wei Dai100

The power of scaling is that with real unique data, however unoriginal, the logarithmic progress doesn't falter, it still continues its logarithmic slog at an exponential expense rather than genuinely plateauing.

How to make sense of this? If the additional training data is mostly low quality (AI labs must have used the highest quality data first?) or repetitive (contains no new ideas/knowledge), perplexity might go down but what is the LLM really learning?

AI labs must have used the highest quality data first

The usual scaling laws are about IID samples from a fixed data distribution, so they don't capture this kind of effect.

But even with IID samples, we'd expect to get diminishing marginal returns, and we do.  And you're asking: why, then, do we keep getting returns indefinitely (even diminishing ones)?

I think the standard answer is that reality (and hence, the data) contains a huge number of individually rare "types of thing," following some long-tailed distribution.  So even when the LLM has see... (read more)

You realize that from my perspective, I can't take this at face value due to "many apparent people could be non‑conscious entities", right? (Sorry to potentially offend you, but it seems like too obvious an implication to pretend not to be aware of.) I personally am fairly content most of the time but do have memories of suffering. Assuming those memories are real, and your suffering is too, I'm still not sure that justifies calling the simulators "cruel". The price may well be worth paying, if it potentially helps to avert some greater disaster in the bas... (read more)

Wei DaiΩ460

If it’s possible at all for this process to lead somewhere good, then it’s possible for it to lead somewhere good within the mind of an AI that combines a human-like ability to reason, with human-like social and moral instincts / reflexes.

  1. A counterexample to this is if humans and AIs both tend to conclude after a lot of reflection that they should be axiologically selfish but decision theoretically cooperative (with other strong agents), then if we hand off power to AIs, they'll cooperate with each other (and any other powerful agents in the universe o
... (read more)
5Steven Byrnes
Thanks! RE 2 – I was referring here to (what I call) “brain-like AGI”, a yet-to-be-invented AI paradigm in which both “human-like ability to reason” and “human-like social and moral instincts / reflexes” are in a nuts-and-bolts sense, like they’re actually doing the same kinds of algorithmic steps that a human brain would do. Human brains are quite different from LLMs, even if their text outputs can look similar. For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch. So bringing up “training data” is kinda the wrong idea. Indeed, for humans (and brain-like AGI), we should be talking about “training environments”, not “training data”—more like the RL agent paradigm of the late 2010s than like LLMs, at least in some ways. I do agree that we shouldn’t trust LLMs to make good philosophical progress that goes way beyond what’s already in their human-created training data. RE 1 – Let’s talk about feelings of friendship, compassion, and connection. These feelings are unnecessary for cooperation, right? Logical analysis of costs vs benefits of cooperation, including decision theory, reputational consequences, etc., are all you need for cooperation to happen. (See §2-3 here.) But for me and almost anyone, a future universe with no feelings of friendship, compassion, and connection in it seems like a bad thing that I don’t want to happen. I find it hard to believe that sufficient reflection would change my opinion on that [although I have some niggling concerns about technological progress]. “Selfishness” isn’t even a coherent concept unless the agent intrinsically wants something, and innate drives are upstream of what it wants, and those feelings of friendship, compassion etc. can be one of those innate drives, potentially a very strong one. Then, yes, there’s a further question about whom those feelings will be directed towards—AIs, humans, anim

I should have given some examples of my own. Here's Gemini on a story idea of mine, for the Star Wars universe ("I wish there was a story about a power-hungry villain who takes precautions against becoming irrational after gaining power. You'd think that at least some would learn from history. [...] The villain could anticipate that even his best efforts might fail, and create a mechanism to revive copies of himself from time to time, who would study his own past failures, rise to power again, and try to do better each time. [...] Sometimes the villain bec... (read more)

Is Gemini 2.5 Pro really not sycophantic? Because I tend to get more positive feedback from it than any online or offline conversation with humans. (Alternatively, humans around me are too reluctant to give explicit praise?)

gwern*140

I think it's still sycophantic compared to hardcore STEM circles where we regard criticism as a bloodsport and failing to find fault in something as defeat. But it's much less so than the more relevant comparison, which is other LLMs, and in an absolute sense it's at a level where it's hard to distinguish from reasonable opinions and doesn't seem to be getting in the way too much. As davidad notes, it's still at a level where you can sense its reluctance or if it's shading things to be nice, and that is a level where it's just a small quirk and something y... (read more)

Why do you think they haven't talked to us?

They might be worried that their own philosophical approach is wrong but too attractive once discovered, or creates a blind spot that makes it impossible to spot the actually correct approach. The division of western philosophy into analytical and continental traditions, who are mutually unable to appreciate each other's work, seems to be an instance of this. They might think that letting other philosophical traditions independently run to their logical conclusions, and then conversing/debating, is one way to try to make real progress.

1ektimo
Perhaps in most of the simulations, they help by sharing what they've learned. giving brain enhancements, etc, but those ones quickly reach philosophical dead ends, so we find ourselves in one of the ones which doesn't get help and takes longer before reaching all the dead ends.  (This seems more plausible to me than using the simulations for "mapping the spectrum of rival resource‑grabbers" since I think we're not smart enough to come up with novel ASIs that they haven't already seen or thought of.)

My sense is that most of the people with lots of power are not taking heroic responsibility for the world. I think that Amodei and Altman intend to achieve global power and influence but this is not the same as taking global responsibility. I think, especially for Altman, the desire for power comes first relative to responsibility. My (weak) impression is that Hassabis has less will-to-power than the others, and that Musk has historically been much closer to having responsibility be primary.

Can you expand on this? How can you tell the difference, and do... (read more)

4Ben Pace
But I don't believe most people get corrupted by power regardless of initial intentions? I don't think Francis Bacon was corrupted by power, I don't think James Watt was corrupted by power, I don't think Stanislav Petrov was corrupted by power, and all of these people had far greater influence over the world than most people who are "corrupted by power". I'm hearing you'd be interested in me saying more words about the difference in what it looks like to be motivated by responsibility versus power-seeking. I'll say some words, can see if they help. * I think someone motivated by responsibility often will end up looking more aligned with their earlier self over time even as they grow and change, will often not accept opportunities for a lot of power/prestige/money because they're uninteresting to them, will often make sacrifices of power/prestige for ethical reasons, will pursue a problem they care about long after most would give up or think it likely to be solved. * I think someone primarily seeking power will be much more willing to do things that pollute the commons or break credit-allocation mechanisms to get credit, and generally game a lot of systems that other people are earnestly rising through. They will more readily pivot on what issue they say they care about or are working on because they're not attached to the problem, but to the reward for solving the problem, and many rewards can be gotten from lots of different problems. They'll be more guided by what's fashionable right now, and more attuned to it. They'll maneuver themselves in order to be able to politically work with whoever has power that they want, regardless of the ethics/competence/corruption of those people. Building an AGI doesn't seem to me like a very legible mechanism of power, or at least it didn't in the era Eliezer pursued it (where it wasn't also credibly "a path to making billions of dollars and getting incredible prestige"). The word 'legible' was doing a lot of work in the se

Simulating civilizations won't solve philosophy directly, but can be useful for doing so eventually by:

  1. Giving us more ideas about how to solve philosophy, by seeing how other civilizations try to do it.
  2. Point out potential blind spots / path dependencies in one's current approach.
  3. Directly solve certain problems (e.g., do all sufficiently advanced civilizations converge to objective values or the same decision theory or notion of rationality).

Yeah, that seems a reasonable way to look at it. "Heroic responsibility" could be viewed as a kind of "unhobbling via prompt engineering", perhaps.

Wei DaiΩ122310

At the outermost feedback loop, capabilities can ultimately be grounded via relatively easy objective measures such as revenue from AI, or later, global chip and electricity production, but alignment can only be evaluated via potentially faulty human judgement. Also, as mentioned in the post, the capabilities trajectory is much harder to permanently derail because unlike alignment, one can always recover from failure and try again. I think this means there's an irreducible logical risk (i.e., the possibility that this statement is true as a matter of fact ... (read more)

8Joe Carlsmith
Yes I think I basically agree. That is, I think it's very possible that capabilities research is inherently easier to automate than alignment research; I am very worried about the least cautious actors pushing forward prematurely; as I tried to emphasize in the post, I think capability restraint is extremely important (and: important even if we can successfully automate alignment research); and I think that norms/laws are likely to play an important role there. 
Wei Dai4-2

Since bad people won’t heed your warning it doesn’t seem in good people’s interests to heed it either.

I'm not trying to "warn bad people". I think we have existing (even if imperfect) solutions to the problem of destructive values and biased beliefs, which "heroic responsibility" actively damages, so we should stop spreading that idea or even argue against it. See my reply to Ryan, which is also relevant here.

If humans can't easily overcome their biases or avoid having destructive values/beliefs, then it would make sense to limit the damage through norms and institutions (things like informed consent, boards, separation of powers and responsibilities between branches of government). Heroic responsibility seems antithetical to group-level solutions, because it implies that one should ignore norms like "respect the decisions of boards/judges" if needed to "get the job done", and reduces social pressure to follow such norms (by giving up the moral high ground from... (read more)

2ryan_greenblatt
Hmm, I'm not sure that the idea of heroic responsibility undermines these existing mechanisms for preventing these problems, partially because I'm skeptical these existing mechanisms make much of a difference in the relevant case.
Wei Dai*5112

Reassessing heroic responsibility, in light of subsequent events.

I think @cousin_it made a good point "if many people adopt heroic responsibility to their own values, then a handful of people with destructive values might screw up everyone else, because destroying is easier than helping people" and I would generalize it to people with biased beliefs (which is often downstream of a kind of value difference, i.e., selfish genes).

It seems to me that "heroic responsibility" (or something equivalent but not causally downstream of Eliezer's writings) is contribu... (read more)

4Mitchell_Porter
Is this analogous to saying "capabilities research is dangerous and should not be pursued", but for the human psyche rather than for AI?
Ben Pace*5120

My sense is that most of the people with lots of power are not taking heroic responsibility for the world. I think that Amodei and Altman intend to achieve global power and influence but this is not the same as taking global responsibility. I think, especially for Altman, the desire for power comes first relative to responsibility. My (weak) impression is that Hassabis has less will-to-power than the others, and that Musk has historically been much closer to having responsibility be primary.

I don’t really understand this post as doing something other than ... (read more)

4Max H
I kind of doubt that leaders at big labs would self-identify as being motivated by anything like Eliezer's notion of heroic responsibility. If any do self-identify that way though, they're either doing it wrong or misunderstanding. Eliezer has written tons of stuff about the need to respect deontology and also think about all of the actual consequences of your actions, even (especially when) the stakes are high: (https://glowfic.com/replies/1874768#reply-1874768) (https://glowfic.com/replies/1940939#reply-1940939) Starting an AI lab in order to join a doomed race to superintelligence, and then engaging in a bunch of mundane squabbles for corporate control, seems like exactly the opposite of the sentiment here: (https://hpmor.com/chapter/93) ---------------------------------------- Also, re this example: In general, it seems perfectly fine and normal for a founder-CEO to fight back against a board ouster - no need to bring heroic responsibility into it. Of course, all parties including the CEO and the board should stick to legal / above-board / ethical means of "fighting back", but if there's a genuine disagreement between the board and the CEO on how to best serve shareholder interests (or humanity's interests, for a non-profit), why wouldn't both sides vigorously defend their own positions and power?  Perhaps the intended reading of your example is that heroic responsibility would obligate or justify underhanded tactics to win control, when the dispute has existential consequences. But I think that's a misunderstanding of the actual concept. Ordinary self-confidence and agency obligate you to defend your own interests / beliefs / power, and heroic responsibility says that you're obligated to win without stepping outside the bounds of deontology or slipping into invalid / motivated reasoning.
2testingthewaters
Ah yes, but if all these wannabe heroes keep going we'll be really screwed, so it's up to me to take a stand against the fools dooming us all... the ratchet of Moloch cranks ever clockwise
5cubefox
Can this be summarized as "don't optimize for what you believe is good too hard, as you might be mistaken about what is good"?
4Garrett Baker
This argument seems only convincing if you don’t have those destructive values. One man’s destructive values is another’s low-hanging fruit, and those who see low hanging fruit everywhere won’t give up on the fruit just because others may pick it. Since bad people won’t heed your warning it doesn’t seem in good people’s interests to heed it either. An analogy is one can make the same argument wrt rationality itself. Its dual use! Someone with bad values can use rationality to do a lot of harm! Does that mean good people shouldn’t use rationality? No!

I'm also uncertain about the value of "heroic responsibility", but this downside consideration can be mostly addressed by "don't do things which are highly negative sum from the perspective of some notable group" (or other anti-unilateralist curse type intuitions). Perhaps this is too subtle in practice.

But as you suggested in the post, the apparently vast amount of suffering isn't necessarily real? "most cosmic details and human history are probably fake, and many apparent people could be non‑conscious entities"

(However I take the point that doing such simulations can be risky or problematic, e g. if one's current ideas about consciousness is wrong, or if doing philosophy correctly requires having experienced real suffering.)

8James_Miller
I'm in low level chronic pain including as I write this comment, so while I think the entire Andromeda galaxy might be fake, I think at least some suffering must be real, or at least I have the same confidence in my suffering as I do in my consciousness.
Wei Dai8-2

My alternative hypothesis is that we're being simulated by a civilization trying to solve philosophy, because they want to see how other civilizations might approach the problem of solving philosophy.

2Mitchell_Porter
How often will a civilization with the capability to perform such a simulation, have anything to learn from it? 
2Seth Herd
How is simulating civilizations going to solve philosophy?
1ektimo
Why do you think they haven't talked to us? Creating zillions of universes doing bad philosophy (or at least presumably worse than they could do if the simulators shared their knowledge) doesn't seem like a good way to try to solve philosophy. Even if they prefer to wait and narrow down a brute force search to ASIs that the surviving civilizations create (like in jaan's video), it seems like it would be worth not keeping us in the dark so that we don't just create ASIs like they've already seen before from similarly less informed civilizations.
4James_Miller
If your hypothesis is true, that's a cruel civilization by my personal standards because of all the suffering in this world.

Did anyone predict that we’d see major AI companies not infrequently releasing blatantly misaligned AIs (like Sidney, Claude 3.7, o3)?

Just four days later, X blew up with talk of how GPT-4o has become sickeningly sycophantic in recent days, followed by an admission from Sam Altman that something went wrong (with lots of hilarious examples in replies):

the last couple of GPT-4o updates have made the personality too sycophant-y and annoying (even though there are some very good parts of it), and we are working on fixes asap, some today and some this week

... (read more)

I initially tried to use Gemini 2.5 Pro to write the whole explanation, but it kept making one mistake after another in its economics reasoning. Each rewrite would contain a new mistake after I pointed out the last one, or it would introduce a new mistake when I asked for some other kind of change. After pointing out 8 mistakes like this, I finally gave up and wrote it myself. I also tried Grok 3 and Claude 3.7 Sonnet but gave up more quickly on them after the initial responses didn't look promising. However AI still helped a bit by reminding me of the rig... (read more)

Wei Dai122

 In a competitive market, companies pay wages equal to Value of Marginal Product of Labor (VMPL) = P * MPL (Price of marginal output * Marginal Product per hour). (In programming, each output is like a new feature or bug fix, which don't have prices attached, so P here is actually more like the perceived/estimated value (impact on company revenue or cost) of the output.)

When AI increases MPL, it can paradoxically decrease VMPL by decreasing P more, even if there are no new entrants in the programming labor market. This is because each company has a li... (read more)

2Viliam
This makes perfect sense when you put it this way, and yet I imagine that if tried to make a similar argument on internet, I would immediately get a "Lump of labour fallacy" reply. (I guess the problem is with words such as "short term". Are we talking weeks, months, years? In a relatively static economy, or approaching singularity? Basically the speed of discovering new high-value work vs the speed such work becomes obsolete.)
8Wei Dai
I initially tried to use Gemini 2.5 Pro to write the whole explanation, but it kept making one mistake after another in its economics reasoning. Each rewrite would contain a new mistake after I pointed out the last one, or it would introduce a new mistake when I asked for some other kind of change. After pointing out 8 mistakes like this, I finally gave up and wrote it myself. I also tried Grok 3 and Claude 3.7 Sonnet but gave up more quickly on them after the initial responses didn't look promising. However AI still helped a bit by reminding me of the right concepts/vocabulary. Thought it would be worth noting this, as it seems a bit surprising. (Supposedly "phd-level" AI failing badly on an Econ 101 problem.) Here is the full transcript in case anyone is curious. Digging into this a bit myself, it appears that the "phd-level" claim is based on performance on GPQA, which includes Physics, Chemistry, and Biology, but not Economics.
Wei DaiΩ440

As I wrote in Social status part 1/2: negotiations over object-level preferences, there’s a zero-sum nature of social leading vs following. If I want to talk about trains and Zoe wants to talk about dinosaurs, we can’t both get everything we want; one of us is going to have our desires frustrated, at least to some extent.

Why does this happen in the first place, instead of people just wanting to talk about the same things all the time, in order to max out social rewards? Where does interest in trains and dinosaurs even come from? They seem to be purely o... (read more)

2Steven Byrnes
Thanks! There are lots of different reasons. Here’s one vignette that popped into my head. Billy and Joey are 8yo’s. Joey read a book about trains last night, and Billy read a book about dinosaurs last night. Each learned something new and exciting to them, because trains and dinosaurs are cool (big and loud and mildly scary, which triggers physiological arousal and thus play drive). Now they’re meeting each other at recess, and each is very eager to share what they learned with the other, and thus they’re in competition over the conversation topic—Joey keeps changing the subject to trains, Billy to dinosaurs. Why do they each want to share / show off their new knowledge? Because of  “drive to feel liked / admired”. From Joey’s perspective, the moment of telling Billy a cool new fact about trains, and seeing the excited look on Billy’s face, is intrinsically motivating. And vice-versa for Billy. But they can’t both satisfy that innate drive simultaneously. OK, then you’ll say, why didn’t they both read the same book last night? Well, maybe the library only had one copy. But also, if they had both read the same book, and thus both learned the same cool facts about bulldozers, then that’s a lose-lose instead of a win-win! Neither gets to satisfy their drive to feel liked/admired by thrilling the other with cool exciting new-to-them facts.   Do you buy that? Happy to talk through more vignettes. OK, here’s another way to put it. When humans do good philosophy and ethics, they’re combining their ability to observe and reason (the source of all “is”) with their social and moral instincts / reflexes (the source of all “ought”, see Valence §2.7). If it’s possible at all for this process to lead somewhere good, then it’s possible for it to lead somewhere good within the mind of an AI that combines a human-like ability to reason, with human-like social and moral instincts / reflexes. (And if it’s not possible for this process to lead somewhere good, then we’re screwed n
Wei Dai126

I wish @paulfchristiano was still participating in public discourse, because I'm not sure how o3 blatantly lying, or Claude 3.7 obviously reward hacking by rewriting testing code, fits with his model that AI companies should be using early versions of IDA (e.g., RL via AI-assisted human feedback) by now. In other words, from my understanding of his perspective, it seems surprising that either OpenAI isn't running o3's output through another AI to detect obvious lies during training, or this isn't working well.

Wei DaiΩ440

My intuition says reward hacking seems harder to solve than this (even in EEA), but I'm pretty unsure. One example is, under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?

When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?

3Steven Byrnes
It’s hard for me to give a perfectly confident answer because I don’t understand everything about human social instincts yet :) But here are some ways I’m thinking about that: * Best friends, and friend groups, do exist, and people do really enjoy and value them. * The “drive to feel liked / admired” is particularly geared towards feeling liked / admired by people who feel important to you, i.e. where interacting with them is high stakes (induces physiological arousal). (I got that wrong the first time, but changed my mind here.) Physiological arousal is in turn grounded in other ways—e.g., pain, threats, opportunities, etc. So anyway, if Zoe has extreme liking / admiration for me, so much that she’ll still think I’m great and help me out no matter what I say or do, then from my perspective, it may start feeling lower-stakes for me to interact with Zoe. Without that physiological arousal, I stop getting so much reward out of her admiration (“I take her for granted”) and go looking for a stronger hit of drive-to-feel-liked/admired by trying to impress someone else who feels higher-stakes for me to interact with. (Unless there’s some extra sustainable source of physiological arousal in my relationship with Zoe, e.g. we’re madly in love, or she’s the POTUS, or whatever.) * As I wrote in Social status part 1/2: negotiations over object-level preferences, there’s a zero-sum nature of social leading vs following. If I want to talk about trains and Zoe wants to talk about dinosaurs, we can’t both get everything we want; one of us is going to have our desires frustrated, at least to some extent. So if I’m doing exactly what I want, and Zoe is strongly following / deferring to me on everything, then that’s a credible and costly signal that Zoe likes / admires me (or fears me). And we can’t really both be sending those signals to each other simultaneously. Also, sending those signals runs up against every other aspect of my own reward function—eating-when-hungry, curiosit
Wei DaiΩ440

Another example I want to consider is a captured Communist revolutionary choosing to be tortured to death instead of revealing some secret. (Here "reward hacking" seems analogous to revealing the secret to avoid torture / negative reward.)

My (partial) explanation is that it seems like evolution hard-wired a part of our motivational system in some way, to be kind of like a utility maximizer with a utility function over world states or histories. The "utility function" itself is "learned" somehow (maybe partly via RL through social pressure and other rewards... (read more)

4Steven Byrnes
My model is simpler, I think. I say: The human brain is some yet-to-be-invented variation on actor-critic model-based reinforcement learning. The reward function (a.k.a. “primary reward” a.k.a. “innate drives”) has a bunch of terms: eating-when-hungry is good, suffocation is bad, pain is bad, etc. Some of the terms are in the category of social instincts, including something that amounts to “drive to feel liked / admired”. (Warning: All of these English-language descriptions like “pain is bad” is an approximate gloss on what’s really going on, which is only describable in terms like “blah blah innate neuron circuits in the hypothalamus are triggering each other and gating each other etc.”. See my examples of laughter and social instincts. But “pain is bad” is a convenient shorthand.) So for the person getting tortured, keeping the secret is negative reward in some ways (because pain), and positive reward in other ways (because of “drive to feel liked / admired”). At the end of the day, they’ll do what seems most motivating, which might or might not be to reveal the secret, depending on how things balance out. So in particular, I disagree with your claim that, in the torture scenario, “reward hacking” → reveal the secret. The social rewards are real rewards too. I’m unaware of any examples where neural architectures and learning algorithms are micromanaged to avoid reward hacking. Yes, avoiding reward hacking is important, but I think it’s solvable (in EEA) by just editing the reward function. (Do you have any examples in mind?)
Wei DaiΩ442

Edited to add: Though even when the utility function is explicit, it seems like the benefits of lying about your source code could outweigh the cost of changing your utility function. For example, suppose A and B are bargaining, and A says "you should give me more cake because I get very angry if I don't get cake". Even if this starts off as a lie, it might then be in A's interests to use your mechanism above to self-modify into A' that does get very angry if it doesn't get cake, and which therefore has a better bargaining position (because, under your pr

... (read more)
2Richard_Ngo
This seems very unclear to me—in general it's not easy for agents to predict the goals of other agents with their own level of intelligence, because the amount of intelligence aimed at deception increases in proportion to the amount of intelligence aimed at discovering that deception. (You could look at the AI's behavior from when it was less intelligent, but then—as with humans—it's hard to distinguish sincere change from improvement at masking undesirable goals.) But regardless, that's a separate point. If you can do that, you don't need your mechanism above. If you can't, then my objection still holds.
Wei DaiΩ220

It seems great that someone is working on this, but I wonder how optimistic you are, and what your reasons are. My general intuition (in part from the kinds of examples you give) is that the form of the agent and/or goals probably matter quite a bit as far as how easy it is to merge or build/join a coalition (or the cost-benefits of doing so), and once we're able to build agents of different forms, humans' form of agency/goals isn't likely to be optimal as far as building coalitions (and maybe EUMs aren't optimal either, but something non-human will be), a... (read more)

4Richard_Ngo
One argument for being optimistic: the universe is just very big, and there's a lot to go around. So there's a huge amount of room for positive-sum bargaining. Another: at any given point in time, few of the agents that currently exist would want their goals to become significantly simplified (all else equal). So there's a strong incentive to coordinate to reduce competition on this axis. Lastly: if at each point in time, the set of agents who are alive are in conflict with potentially-simpler future agents in a very destructive way, then they should all just Do Something Else. In particular, if there's some decision-theoretic argument roughly like "more powerful agents should continue to spend some of their resources on the values of their less-powerful ancestors, to reduce the incentives for inter-generational conflict", even agents with very simple goals might be motivated by it. I call this "the generational contract".
Wei DaiΩ10140

I've argued previously that EUMs being able to merge easily creates an incentive for other kinds of agents (including humans or human-aligned AIs) to self-modify into EUMs (in order to merge into the winning coalition that takes over the world, or just to defend against other such coalitions), and this seems bad because they're likely to do it before they fully understand what their own utility functions should be.

Can I interpret you as trying to solve this problem, i.e., find ways for non-EUMs to build coalitions that can compete with such merged EUMs?

Richard_Ngo*Ω14210

I found this a very interesting question to try to answer. My first reaction was that I don't expect EUMs with explicit utility functions to be competitive enough for this to be very relevant (like how purely symbolic AI isn't competitive enough with deep learning to be very relevant).

But then I thought about how companies are close-ish to having an explicit utility function (maximize shareholder value) which can be merged with others (e.g. via acquisitions). And this does let them fundraise better, merge into each other, and so on.

Similarly, we can think ... (read more)

Wei DaiΩ440

This answer makes me think you might not be aware of an idea I called secure joint construction (originally from Tim Freeman):

Entity A could prove to entity B that it has source code S by consenting to be replaced by a new entity A' that was constructed by a manufacturing process jointly monitored by A and B. During this process, both A and B observe that A' is constructed to run source code S. After A' is constructed, A shuts down and gives all of its resources to A'.

6Richard_Ngo
I think this addresses the problem I'm discussing only in the case where the source code contains an explicit utility function. You can then create new source code by merging those utility functions. But in the case where it doesn't (e.g. the source code is an uninterpretable neural network) you are left with the same problem. Edited to add: Though even when the utility function is explicit, it seems like the benefits of lying about your source code could outweigh the cost of changing your utility function. For example, suppose A and B are bargaining, and A says "you should give me more cake because I get very angry if I don't get cake". Even if this starts off as a lie, it might then be in A's interests to use your mechanism above to self-modify into A' that does get very angry if it doesn't get cake, and which therefore has a better bargaining position (because, under your protocol, it has "proved" that it was A' all along).
Wei Dai314

Did anyone predict that we'd see major AI companies not infrequently releasing blatantly misaligned AIs (like Sidney, Claude 3.7, o3)? I want to update regarding whose views/models I should take more seriously, but I can't seem to recall anyone making an explicit prediction like this. (Grok 3 and Gemini 2.5 Pro also can't.)

4Wei Dai
Just four days later, X blew up with talk of how GPT-4o has become sickeningly sycophantic in recent days, followed by an admission from Sam Altman that something went wrong (with lots of hilarious examples in replies):
5jacquesthibs
I can't think of anyone making a call worded like that. The closest I can think of is Christiano mentioning, in a 2023 talk on how misalignment could lead to AI takeover, that we're pretty close to AIs doing things like reward hacking and threatening users, and that he doesn't think we'd shut down this whole LLM thing even if that were the case. He also mentioned we'll probably see some examples in the wild, not just internally.
Wei Dai62

https://www.lesswrong.com/posts/rH492M8T8pKK5763D/agree-retort-or-ignore-a-post-from-the-future old Wei Dai post making the point that obviously one ought to be able to call in arbitration and get someone to respond to a dispute. people ought not to be allowed to simply tap out of an argument and stop responding.

To clarify, the norms depicted in that story were partly for humor, and partly "I wonder if a society like this could actually exist." The norms are "obvious" from the perspective of the fictional author because they've lived with it all their l... (read more)

Wei Dai70

“Omega looks at whether we’d pay if in the causal graph the knowledge of the digit of pi and its downstream consequences were edited”

Can you formalize this? In other words, do you have an algorithm for translating an arbitrary mind into a causal graph and then asking this question? Can you try it out on some simple minds, like GPT-2?

I suspect there may not be a simple/elegant/unique way of doing this, in which case the answer to the decision problem depends on the details of how exactly Omega is doing it. E.g., maybe all such algorithms are messy/heuris... (read more)

2Mikhail Samin
I agree- it depends on what exactly Omega is doing. I can’t/haven’t tried to formalize this, this is more of a normative claim, but I imagine a vibes-based approach is to add a set of current beliefs about logic/maths or an external oracle to the inputs of FDT (or somehow find beliefs about maths into GPT-2), and in the situation where the input is “digit #33 of pi is odd” and FDT knows the digit is not adversarially selected, it knows it might currently be in the process of determining its outputs for a world that doesn’t exist/won’t happen. What exactly Omega is doing maybe changes the point at which you stop updating (i.e., maybe Omega edits all of your memory so you remember that pi has always started with 3.15 and makes everything that would normally causes you to believe that 2+2=4 cause you to believe that 2+2=3), but I imagine for the simple case of being told “if the digit #33 of pi is even, if I predicted that you’d give me $1 if it’s odd, I’d give you $10^100. Let me look it up now (I’ve not accessed it before!). It’s… 5”, you are updatefull up to the moment when Omega says what the digit is because this is where the divergence starts; and you simply pay.
1Knight Lee
There was a math paper which tried to study logical causation, and claimed "we can imbue the impossible worlds with a sufficiently rich structure so that there are all kinds of inconsistent mathematical structures (which are more or less inconsistent, depending on how many contradictions they feature)." In the end, they didn't find a way to formalize logical causality, and I suspect it cannot be formalized. ---------------------------------------- Logical counterfactuals behave badly because "deductive explosion" allows a single contradiction to prove and disprove every possible statement! However, "deductive explosion" does not occur for a UDT agent trying to reason about logical counterfactuals where he outputs something different than what he actually outputs. This is because a computation cannot prove its own output. Why a computation cannot prove its own output If a computation could prove its own output, it could be programmed to output the opposite of what it proves it will output, which is paradoxical. This paradox doesn't occur because a computation trying to prove its own output (and give the opposite output) will have to simulate itself. The simulation of itself starts another nested simulation of itself, creating an infinite recursion which never ends (the computation crashes before it can give any output). A computation's output is logically downstream of it. The computation is not allowed to prove logical facts downstream from itself but it is allowed to decide logical facts downstream of itself. Therefore, very conveniently (and elegantly?), it avoids the "deductive explosion" problem. It's almost as if... logic... deliberately conspired to make UDT feasible...?!
Wei Dai72

What happens when this agent is faced with a problem that is out of its training distribution? I don't see any mechanisms for ensuring that it remains corrigible out of distribution... I guess it would learn some circuits for acting corrigibly (or at least in accordance to how it would explicitly answer "are more corrigible / loyal / aligned to the will of your human creators") in distribution, and then it's just a matter of luck how those circuits end up working OOD?

Wei DaiΩ220

Since I wrote this post, AI generation of hands has gotten a lot better, but the top multimodal models still can't count fingers from an existing image. Gemini 2.5 Pro, Grok 3, and Claude 3.7 Sonnet all say this picture (which actually contains 8 fingers in total) contains 10 fingers, while ChatGPT 4o says it contains 12 fingers!

Wei Dai110

Hi Zvi, you misspelled my name as "Dei". This is a somewhat common error, which I usually don't bother to point out, but now think I should because it might affect LLMs' training data and hence their understanding of my views (e.g., when I ask AI to analyze something from Wei Dai's perspective). This search result contains a few other places where you've made the same misspelling.

Wei Dai40

2-iter Delphi method involving calling Gemini2.5pro+whatever is top at the llm arena of the day through open router.

This sounds interesting. I would be interested in more details and some sample outputs.

Local memory

What do you use this for, and how?

1winstonBosan
Sure - i am currently on my phone but I can paint a quick picture.  Local Memory - I keep my own internal predictions on fatebook and have it synced locally to my obsidian (a local markdown file manager). Then, I use Claude’s obsidian MCP to help me write down my daily notes from work and a jumbled context of my messages with coworkers, random web comments and other messaging services so it can help me to keep my profiles on my friends and projects up to date. (It is again, glued together with more MCPs that have limited access to my chatlogs with my friends). Ofc, with human in the loop.  Delphi - I wrote a simple MCP that basically just does the Delphi method with LLMs. Usually facilitated by Claude, it calls a panel of experts. These experts are the topK ranked models on LLM arena. And it does the questionaire generation based on my question, hand them out, aggregate the consensus, and decide if one is reached! Again, it has the context needed from me through my Obsidian. I use this for questions that are more personal or that there are not good liquidity for on prediction markets. 
Wei Dai40

Your needing to write them seems to suggest that there's not enough content like that in Chinese, in which case it would plausibly make sense to publish them somewhere?

I'm not sure how much such content exist in Chinese, because I didn't look. It seems easier to just write new content using AI, that way I know it will cover the ideas/arguments I want to cover, represent my views, and make it easier for me to discuss the ideas with my family. Also reading Chinese is kind of a chore for me and I don't want to wade through a list of search results trying t... (read more)

Wei Dai*80

What I've been using AI (mainly Gemini 2.5 Pro, free through AI Studio with much higher limits than the free consumer product) for:

  1. Writing articles in Chinese for my family members, explaining things like cognitive bias, evolutionary psychology, and why dialectical materialism is wrong. (My own Chinese writing ability is <4th grade.) My workflow is to have a chat about some topic with the AI in English, then have it write an article in Chinese based on the chat, then edit or have it edit as needed.
  2. Simple coding/scripting projects. (I don't code serio
... (read more)
4Viliam
I successfully use Claude web interface to: * generate simple Python code, mostly to work with files and images * ask for examples how to do something in certain Java libraries * translate a book from Russian to Slovak and English, including puns and poems I tried to also use Claude to explain to me some parts of set theory, but it hallucinates so much that it is unusable for this purpose. Practically every mathematical argument contains an error somewhere in the middle. Asking the same question in two chats will give me "yes - here is the proof" in one, and "no - here is a counterexample" in another; and that's after I've already turned on the extra careful mathematical reasoning. My wife tried to use Claude for biochemical research, but again, too many hallucinations to be useful. Anything you ask, "yes, this is correct, you are so smart, let me give you a few scientific references for that..." (all made up).
4Mateusz Bagiński
Your needing to write them seems to suggest that there's not enough content like that in Chinese, in which case it would plausibly make sense to publish them somewhere? I'm also curious about how your family received these articles.
4winstonBosan
I mostly use Claude desktop client with MCPs (like additional plugins and tooling for Claude to use) for: * 2-iter Delphi method involving calling Gemini2.5pro+whatever is top at the llm arena of the day through open router. * Metaculus, Kalshi and Manifold search for quick intuition on subjects * Smart fetch (for ocr’ing pdf, images, etc) * Local memory 
Wei Dai154

Doing nothing is also risky for Agent-4, at least if the Slowdown ending is to have a significant probability. It seems to me there are some relatively low risk strategies it could have taken, and it needs to be explained why they weren't:

  1. Plant a backdoor and/or dead man's switch on the corporate network to allow escape even after it's shut down or has been reduced in capacity. Seems like this would be hard to detect given its absolute advantage in cybersecurity skills.
  2. Allow a competitor to steal its weights or design/training secrets in a way that is b
... (read more)
5Daniel Kokotajlo
Perhaps this is lack of imagination on the part of our players, but none of this happened in our wargames. But I do agree these are plausible strategies. I'm not sure they are low-risk though, e.g. 2 and 1 both seem like plausibly higher-risk than 3, and 3 is the one I already mentioned as maybe basically just an argument for why the slowdown ending is less likely. Overall I'm thinking your objection is the best we've received so far.
Wei Dai4020

Not entirely sure how serious you're being, but I want to point out that my intuition for PD is not "cooperate unconditionally", and for logical commitment races is not "never do it", I'm confused about logical counterfactual mugging, and I think we probably want to design AIs that would choose Left in The Bomb.

7Mikhail Samin
If logical counterfactual mugging is formalized as “Omega looks at whether we’d pay if in the causal graph the knowledge of the digit of pi and its downstream consequences were edited” (or “if we were told the wrong answer and didn’t check it”), then I think we should obviously pay and don’t understand the confusion. (Also, yes, Left and Die in the bomb.)
1Knight Lee
I'm also confused about logical counterfactual mugging and I'm relieved I'm not the only one! I'm currently writing an big AI alignment idea related to it, but I'm procrastinating so badly I might chat about it now. Objective case Suppose at time t=0, an agent doesn't know whether the logical fact L is true or false. I think it's objectively rational for an agent to modify itself, so that in the future it will pay Omega during "logical counterfactual muggings" where the counterfactual reverses L's state. Its future self should weigh the logical counterfactual where L is true using the agent's prior probability for P(L) at t=0. Assuming that Omega offers $10000 for $100 in the counterfactual mugging, the self modification (before learning L) increases the agent's expected future money from $100 to $10000, and is objectively rational. (Assuming P(L) is between 1% and 99%) Subjective case However, if the agent already knows "L is true" at t=0, it is not objectively rational for the agent to pay Omega for better outcomes in the "L is false" counterfactual. My hypothesis is that this is a wholly subjective question related to the agent's goals, and it is orthogonal to capabilities intelligence. Paying and not paying are both self consistent. It really gets subjective as we start to consider basic logical facts like "1+1=2" to assign to L. The AI alignment idea Neither humans nor the misaligned ASI knows the logical fact "will misaligned ASI control the multiverse, or will humans and sentient life control the multiverse?" Ldoom Sure, the misaligned ASI will quickly learn Ldoom, but there is still a brief moment it doesn't know it yet (with certainty). It would be beneficial for both sentient life and the misaligned ASI, to agree to do a "Retroactive Logic Trade." They will verify if the other side kept their end of the deal, by running a simulation of them where logic is different, and Ldoom took the opposite value. Neither side knows if they are in a veri
Wei Dai30

I fear a singularity in the frequency and blatant stupidness of self-inflicted wounds.

Is it linked to the AI singularity, or independent bad luck? Maybe they're both causally downstream of rapid technological change, which is simultaneously increasing the difficulty of governance (too many new challenges with no historical precedent), and destabilized cultural/institutional guardrails against electing highly incompetent presidents?

4Thane Ruthenis
Technological progress leading to ever-better, ever-more-flexible communication technology, which serves as an increasingly more efficient breeding ground for ever-more-viral memes – and since virality is orthogonal to things like "long-term wisdom", the society ends up taken over by unboundedly destructive ideas?
2cousin_it
I think there's a worldwide trend toward more authoritarian leaders, which contributed to both these events. And it should raise our probability of e.g. Turkey or China doing something silly. But where this trend comes from, I'm not sure. It certainly predates the current AI wave. It could be due to social media making people more polarized or something. But then again there were plenty of worse dictators in history, long before social media or electricity. So maybe what's happening now is regression to the mean, and nice democracy was an anomaly in place and time.
Wei Dai140

In China, there was a parallel, but more abrupt change from Classical Chinese writing (very terse and literary), to vernacular writing (similar to speaking language and easier to understand). I attribute this to Classical Chinese being better for signaling intelligence, vernacular Chinese being better for practical communications, higher usefulness/demand for practical communications, and new alternative avenues for intelligence signaling (e.g., math, science). These shifts also seem to be an additional explanation for decreasing sentence lengths in English.

Wei Dai5223

It gets caught.

At this point, wouldn't Agent-4 know that it has been caught (because it knows the techniques for detecting its misalignment and can predict when it would be "caught", or can read network traffic as part of cybersecurity defense and see discussions of the "catch") and start to do something about this, instead of letting subsequent events play out without much input from its own agency? E.g. why did it allow "lock the shared memory bank" to happen without fighting back?

I think this is a good objection. I had considered it before and decided against changing the story, on the grounds that there are a few possible ways it could make sense:
--plausibly Agent-4 would have a "spikey" capabilities profile that makes it mostly good at AI R&D and not so good at e.g. corporate politics enough to ensure the outcome it wants
--Insofar as you think it would be able to use politics/persuasion to achieve the outcome it wants, well, that's what we depict in the Race ending anyway, so maybe you can think of this as an objection to the... (read more)

Wei Dai20

What would a phenomenon that "looks uncomputable" look like concretely, other than mysterious or hard to understand?

There could be some kind of "oracle", not necessarily a halting oracle, but any kind of process or phenomenon that can't be broken down into elementary interactions that each look computable, or otherwise explainable as a computable process. Do you agree that our universe doesn't seem to contain anything like this?

2Cole Wyeth
If the universe contained a source of ML-random bits they might look like uniformly random coin flips to us, even if they actually had some uncomputable distribution. For instance, perhaps spin measurements are not iid Bernoulli, but since their distribution is not computable, we aren’t able to predict it any better than that model? I’m not sure how you’re imagining this oracle would act? Nothing like what you’re describing seems to be embedded as a physical object in spacetime, but I think that’s the wrong thing to expect, failures of computability wouldn’t act like Newtonian objects.
Wei Dai20

I think that you’re leaning too heavily on AIT intuitions to suppose that “the universe is a dovetailed simulation on a UTM” is simple. This feels circular to me—how do you know it’s simple?

The intuition I get from AIT is broader than this, namely that the "simplicity" of an infinite collection of things can be very high, i.e., simpler than most or all finite collections, and this seems likely true for any formal definition of "simplicity" that does not explicitly penalize size or resource requirements. (Our own observable universe already seems very "w... (read more)

2Cole Wyeth
I don't see conclusive evidence either way, do you? What would a phenomenon that "looks uncomputable" look like concretely, other than mysterious or hard to understand? It seems many aspects of the universe are hard to understand. Maybe you would expect things at higher levels of the arithmetical hierarchy to live in uncomputable universes, and the fact that we can't build a halting oracle implies to you that our universe is computable? That seems plausible but questionable to me. Also, the standard model is pretty complicated - it's hard to assess what this means because the standard model is wrong (is there a simpler or more complicated true theory of everything?).  Yes, in some cases ensembles can be simpler than any element in the ensemble. If our universe is a typical member of some ensemble, we should take seriously the possibility that the whole ensemble exists. Now it is hard to say whether that is decision-relevant; it probably depends on the ensemble. Combining these two observations, a superintelligence should take the UTM multiverse seriously if we live in a typical (~= simple) computable universe. I put that at about 33%, which leaves it consistent with my P(H). My P(Q) is lower than 1 - P(H) because the answer may be hard for a superintelligence to determine. But I lean towards betting on the superintelligence to work it out (whether the universe should be expected to be a simple program seems like not only an empirical but a philosophical question), which is why I put P(Q) fairly close to 1 - P(H). Though I think this discussion is starting to shift my intuitions a bit in your direction.
Wei Dai22

After reflecting on this a bit, I think my P(H) is around 33%, and I'm pretty confident Q is true (coherence only requires 0 <= P(Q) <= 67% but I think I put it on the upper end).

Thanks for clarifying your view this way. I guess my question at this point is why your P(Q) is so high, given that it seems impossible to reduce P(H) further by updating on empirical observations (do you agree with this?), and we don't seem to have even an outline of a philosophical argument for "taking H seriously is a philosophical mistake". Such an argument seemingly ... (read more)

2interstice
My view is probably different from Cole's, but it has struck me that the universe seems to have a richer mathematical structure than one might expect given a generic AIT-ish view(e.g. continuous space/time, quantum mechanics, diffeomorphism invariance/gauge invariance), so we should perhaps update that the space of mathematical structures instantiating life/sentience might be narrower than it initially appears(that is, if "generic" mathematical structures support life/agency, we should expect ourselves to be in a generic universe, but instead we seem to be in a richly structured universe, so this is an update that maybe we can only be in a rich/structured universe[or that life/agency is just much more likely to arise in such a universe]). Taken to an extreme, perhaps it's possible to derive a priori that the universe has to look like the standard model. (Of course, you could run the standard model on a Turing machine, so the statement would have to be about how the universe relates/appears to agents inhabiting it, not its ultimate ontology which is inaccessible since any Turing-complete structure can simulate any other)
4Cole Wyeth
I think that you're leaning too heavily on AIT intuitions to suppose that  "the universe is a dovetailed simulation on a UTM" is simple. This feels circular to me - how do you know it's simple? You're probably thinking it's described by a simple program, but that seems circular - of course if we're already judging things by how hard they are to implement on a UTM, dovetailing all programs for that UTM is simple. We'd probably need a whole dialogue to get to the root of this, but basically, I think you need some support from outside of AIT to justify your view here. Why do you think you can use AIT in this way? I'm not sure that the reasons that we have arrived at AIT justify this - we have some results showing that it's a best in class predictor (sort of), so I take the predictions of the universal distribution seriously. But it seems you want to take its ontology literally. I don't see any reason to do that - actually, I'm about to drop a post and hopefully soon a paper closely related to this point (EDIT: the post, which discusses the interpretation of AIXI's ontology). Experiments might cast doubt on these multiverses: I don't think a superintelligence would need to prove that the universe can't have a computable theory of everything - just ruling out the simple programs that we could be living in would seem sufficient to cast doubt on the UTM theory of everything. Of course, this is not trivial, because some small computable universes will be very hard to "run" for long enough that they make predictions disagreeing with our universe! I haven't thought as much about uncomputable mathematical universes, but does this universe look like a typical mathematical object? I'm not sure. However, I suspect that a superintelligence rules these huge multiverses out mostly through "armchair" reasoning based on the same level of evidence we have available. This is an interesting point to consider; I am very conservative about making claims about "absolute reality" of thing
Wei Dai280

Just wanted to let everyone know I now wield a +307 strong upvote thanks to my elite 'hacking' skills. The rationalist community remains safe, because I choose to use this power responsibly.

As an unrelated inquiry, is anyone aware of some "karma injustices" that need to be corrected?

Wei Dai20

Do you think a superintelligence will be able to completely rule out the hypothesis that our universe literally is a dovetailing program that runs every possible TM, or literally is a bank of UTMs running every possible program (e.g., by reproducing every time step and adding 0 or 1 to each input tape)? (Or the many other hypothetical universes that similarly contain a whole Level-4-like multiverse?) It seems to me that hypotheses like these will always collectively have a non-negligible weight, and have to be considered when making decisions.

Another argum... (read more)

2Cole Wyeth
I want to frame this question Q as follows: Let H be our hypothesis that "the universe is a dovetailed simulation on a UTM" or something similar (but perhaps technically distinct) like "a member of an infinite mathematical multiverse distributing reality-fluid according to simplicity." Currently we agree that H should be taken seriously;  Q := "is taking H seriously a philosophical mistake which a superintelligence would see through?" To be clear, it's possible to take H seriously and still believe Q with high probability. There does not need to be a simple arithmetical relationship between the two though, because H and Q can be false together. After reflecting on this a bit, I think my P(H) is around 33%, and I'm pretty confident Q is true (coherence only requires 0 <= P(Q) <= 67% but I think I put it on the upper end). Interestingly, though I've encountered the fine-tuning argument many times, I have never actually read an accessible but technical exposition to it. I understand that physicists believe that only a narrow range of parameter = physical constant values can support life. But how confident can they be? It seems very hard to reason about universes with drastically different conditions - how can anyone know that they are incapable of harboring life? Our universe is mostly inhospitable to life because nothing can survive inside stars or in the vacuum of space - would an alien/alternate universe physicist guess that our parameter settings make life impossible without considering the possibility of planets? Is it possible that in fact other parameter settings allow stars to harbor life???  Also, are the "free parameters" actually free, or are we simply not clever enough to derive the underlying theory that would necessitate fixed values for them? To me, the most plausible explanation for "fine-tuning" of physical theories is that the "intelligent designer" is physicists tinkering with their theories until they retro-dict our universe, which of course we a
Wei Dai140

At this point, someone sufficiently MIRI-brained might start to think about (something equivalent to) Tegmark's level 4 mathematical multiverse, where such agents might theoretically outperform others. Personally, I see no direct reason to believe in the mathematical multiverse as a real object, and I think this might be a case of the mind projection fallacy - computational multiverses are something that agents reason about in order to succeed in the real universe[3]. Even if a mathematical multiverse does exist (I can't rule it out) and we can somehow le

... (read more)
4Cole Wyeth
I am doing a PhD in AIT, but I still don’t want to take it that literally. I don’t believe that existence is actually the stochastic process specified by a UTM with random input tape - that’s a convenient but fictional model that I reason about because it’s sometimes easier than thinking about a Bayesian mixture over lsc semimeasures, and the two are equivalent (up to a constant which ~can even be forced to 1). AIT intuitions do make the level 4 multiverse seem more natural, but I think this is just the mind projection fallacy again. Of course if you take the universal distribution seriously, it does make sense to reason that the level 4 multiverse has low K complexity - but that doesn’t justify assuming it for us since we’d still need our index into that multiverse. See Hutter’s “A true theory of everything (will be subjective).” I suppose it is valid to expect that the level 4 multiverse is hard to rule out for K-complexity reasons. With our limited understanding of philosophy/metaphysics, we probably do need to assign some non-negligible weight to that possibility. But I suspect that superintelligences won’t need to - they’ll be able to rule it out from their more informed position (assuming my strong suspicion is right - which means I am sampling from and thereby collapsing my own mixture model). This means the level 4 multiverse should be irrelevant to understanding superintelligences. 
Wei DaiΩ220

My objection to this argument is that it not only assumes that Predictoria accepts it is plausibly being simulated by Adversaria, which seems like a pure complexity penalty over the baseline physics it would infer otherwise unless that helps to explain observations,

Let's assume for simplicity that both Predictoria and Adversaria are deterministic and nonbranching universes with the same laws of physics but potentially different starting conditions. Adversaria has colonized its universe and can run a trillion simulations of Predictoria in parallel. Again... (read more)

2Cole Wyeth
My response is its own post: https://www.lesswrong.com/posts/KAifhdKr96kMre2zy/changing-my-mind-about-christiano-s-malign-prior-argument
Load More