Wei Dai

LESSWRONG
LW

Wei Dai — LessWrong

Replying toDistributed vs centralized agents

How would you categorize a collection of agents that do not have a hierarchical relationship (each one is fully autonomous), but all share the same values? The main consideration that makes me think this is a likely outcome is that coordination seems a lot easier when two or more agents share the same values,^[1] because they would immediately have no reason to lie to each other or do anything else that benefits their own values at the cost of others'. This could come about either by a group of agents with different values making a bargain to all change their values into some compromise values, or by a group of agents already having... (read more)

Wei Dai2h

When applying bold or italic, then going back to Markdown, it becomes like this (I added escape characters in MD so the special characters would render when viewing this comment):

****bold****

*_italic_*

ETA: Apparently this is actually valid MD, just redundant. Is there a reason for it?

Wei Dai2hQuick Take

Not sure if this is already well known around here, but apparently AI companies are heavily subsidizing their subscription plans if you use their own IDEs/CLIs. (It's discussed in various places but I had to search for it.)

I realized this after trying Amp Code. They give out a $10 daily free credit, which can easily be used up in 1 or 2 prompts, e.g., "review this code base, fix any issues found". (They claim to pass their API costs to their customers with no markup, so this seems like a good proxy for actual API costs.) But with even a $19.99 subscription at one of the frontier LLM developers you can do... (read more)

Wei Dai5h*

I think yes, given the following benefits, with the main costs being opportunity cost and risk of losing a bunch of money in an irrational way (e.g. couldn't quit if I turned out to be a bad trader), I think. Am I missing anything or did you have something in mind when asking this?

physical and psychic benefits of having greater wealth/security
social benefits (within my immediate family who know about it, and now among LW)
calibration about how much to trust my own judgment on various things
it's a relatively enjoyable activity (comparable to playing computer games, which ironically I can't seem to find the motivation to play anymore)
some small chance of eventually turning the money into fraction of lightcone
evidence about whether I'm in a simulation
some marginal increase in credibility for my ideas

Wei Dai6h

While we're on the topic, I am kinda worried about Anthropic employees who might be talking to Claude all day and falling into a trap. (thinking of Amanda Askell in particular who's day job is basically this)

I've been worried about this type of thing for a long time, but still didn't foresee or warn people that AI company employees, and specifically alignment/safety workers, could be one of the first victims (which seems really obvious in retrospect). Yet another piece of evidence for how strategically incompetent humans are.

Wei Dai9h

This seemed like a good idea that I spent some time looking into, but ran into a roadblock. My plan was to download all the monthly statements of my accounts (I verified that they're still available, but total more than 1000 so would require some AI assistance/coding just to download/process), build a dataset of the monthly balances, then produce the final stats from the monthly total balances. But when I picked two consecutive monthly statements of a main account to look at, the account value decreased 20% from one month to the next and neither I nor the two AIs I asked (Gemini 3.0 Pro and Perplexity w/ GPT 5.2) could figure... (read more)

Replying toHow do we (more) safely defer to AIs?

Wei Dai9h

How do we (more) safely defer to AIs?

Many of my posts, e.g. Some Thoughts on Metaphilosophy, are relevant here and you may already be familiar with my ideas (or can read them if you want). But I have a new thought about this:

Given that current human experts currently disagree strongly about important aspects of the situation and what should be done, do epistemics at the level of top human experts suffice? My guess is that the answer is yes, though we may need to have the AIs ensemble together different epistemic strategies to diversify.

Current human experts seem to implicitly agree that "ensemble together different epistemic strategies to diversify" is not a good idea, since they each seem to endorse... (read more)

Wei Dai10h

Thanks for taking a look! I've removed the large programmatically generated file (src/generated/graphql.ts) from the repository to improve auditability and reduce diff noise.

I've also added these security mitigations for the development environment:

Dependency Audit: Integrated npm audit --audit-level=high into the build pipeline to automatically block builds with critical vulnerabilities.
Supply Chain: Pinned vite-plugin-monkey version to mitigate potential supply chain risks.

Re: color mixing - noted! It's currently good enough for this use case, but I'll keep the gamma correction resources in mind if we need higher fidelity.

And yes I've been using different LLMs (Gemini, GPT, Claude) to review each other's code.

Please let me know if you have any other suggestions.

Wei Dai21h

The resurrected LessWrong Power Reader is now on GitHub. Thought I'd post it here in a lowkey way in case you (or anyone else) want to test it or review its API usage, make sure it's not doing something undesirable, before I make an announcement post.

the current intro and feature list

A fast, context-first reader for LessWrong and the EA Forum, designed to make high-volume reading and thread navigation feel effortless.

Chronological Reader Core: Shows comments in strict time order with date-based pagination, so you can read without gaps.
Deep Thread Context: Loads missing parents and replies so deep comments still make sense.
Post + Comment Power Actions: Inline controls to expand/load post bodies, load all

... (read more)

Wei Dai1dQuick Take

My 6 years as a trader / active investor

The Dilbert Afterlife by Scott Alexander, Jan 16, 2026:

Michael Jordan was the world’s best basketball player, and insisted on testing himself against baseball, where he failed. Herbert Hoover was one of the world’s best businessmen, and insisted on testing himself against politics, where he crashed and burned. We’re all inmates in prisons of different names. Most of us accept it and get on with our lives. Adams couldn’t stop rattling the bars.

The EMH Aten't Dead by Richard Meadows, May 15, 2020:

Which only leaves the initial claim that "at least for me this puts a final nail in the coffin of EMH."
This is a polite

... (read 676 more words →)

The striking contrast between Jan Leike, Jan 22, 2026:

Our current best overall assessment for how aligned models are is automated auditing. We prompt an auditing agent with a scenario to investigate: e.g. a dark web shopping assistant or an imminent shutdown unless humans are harmed. The auditing agent tries to get the target LLM (i.e. the production LLM we’re trying to align) to behave misaligned, and the resulting trajectory is evaluated by a separate judge LLM. Albeit very imperfect, this is the best alignment metric we have to date, and it has been quite useful in guiding our alignment mitigations work.
[...]
But the most important lesson is that simple interventions are very effective

Wei Dai

11d

If AIs became strategically competent enough, they may realize that RSI is too dangerous because they're not good enough at alignment or philosophy or strategy, and potentially convince, help, or work with humans to implement an AI pause. This presents an alternative "victory condition" that someone could pursue (e.g. by working on AI strategic competence) if they were relatively confident about the alignment of near-human-level AIs but concerned about the AI transition as a whole, for example because they're worried about alignment of ASI, or worried about correctly solving other philosophical problems that would arise during the transition. (But note that if the near-human-level AIs are not aligned, then this effort could... (read 287 more words →)

My attempt to resurrect the old LW Power Reader is facing an obstacle just before the finish line, due to current LW's API limitations. So this is a public appeal to the site admins/devs to relax the limit.

Specifically, my old code relied on LW1 allowing it to fetch all comments posted after a given comment ID, but I can't find anything similar in the current API. I tried reproducing this by using the allRecentComments endpoint in GraphQL, but due to the offset parameter being limited to <2000, I can't fetch comments older than a few weeks. The Power Reader is part designed to allow someone to catch up on or skim weeks/months... (read more)

In retrospect it seems like such a fluke that decision theory in general and UDT in particular became a central concern in AI safety. In most possible worlds (with something like humans) there is probably no Eliezer-like figure, or the Eliezer-like figure isn't particularly interested in decision theory as a central part of AI safety, or doesn't like UDT in particular. I infer this from the fact that where Eliezer's influence is low (e.g. AI labs like Anthropic and OpenAI) there seems little interest in decision theory in connection with AI safety (cf Dario Amodei's recent article which triggered this reflection), and in other places interested in decision theory, that aren't downstream of Eliezer popularizing it, like academic philosophy, there's little interest in UDT.

If this is right, it's another piece of inexplicable personal "luck" from my perspective, i.e., why am I experiencing a rare timeline where I got this recognition/status.

•••

Possible root causes if we don't end up having a good long term future (i.e., realize most of the potential value of the universe), with illustrative examples:

Technical incompetence
- We fail to correctly solve technical problems in AI alignment.
- We fail to build or become any kind of superintelligence.
- We fail to colonize the universe.
Philosophical incompetence
- We fail to solve philosophical problems in AI alignment
- We end up optimizing the universe for wrong values.
Strategic incompetence
- It is not impossible to cooperate/coordinate, but we fail to figure out how.
- We fail to have other important strategic insights
  - E.g., related to whether it's better in the long run to build AGI first, or enhance human intelligence first
- We have the insights but fail to

... (read more)

"Utility" literally means usefulness, in other words instrumental value, but in decision theory and related fields like economics and AI alignment, it (as part of "utility function") is now associated with terminal/intrinsic value, almost the opposite thing (apparently through some quite convoluted history). Somehow this irony only occurred to me ~3 decades after learning about utility functions.

My 2003 Post on the Evolutionary Argument for AI Misalignment

Wei Dai

1mo

This was posted to SL4 on the last day of 2003. I had largely forgotten about it until I saw the LW Wiki reference it under Mesa Optimization^[1]. Besides the reward hacking angle, which is now well-trodden, it gave an argument based on the relationship between philosophy, memetics, and alignment, which has been much less discussed (including in current discussions about human fertility decline), and perhaps still worth reading/thinking about. Overall, the post seems to have aged well, aside from the very last paragraph.

For historical context, Eliezer had coined "Friendly AI" in Creating Friendly AI 1.0 in June 2001. Although most of it was very hard to understand and subsequently disavowed by... (read 429 more words →)

A Conflict Between AI Alignment and Philosophical Competence

Wei Dai

2mo

(This argument reduces my hope that we will have AIs that are both aligned with humans in some sense and also highly philosophically competent, which aside from achieving a durable AI pause, has been my main hope for how the future turns out well. As this is a recent realization^[1], I'm still pretty uncertain how much I should update based on it, or what its full implications are.)

Being a good alignment researcher seems to require a correct understanding of the nature of values. However metaethics is currently an unsolved problem, with all proposed solutions having flawed or inconclusive arguments, and lots of disagreement among philosophers and alignment researchers, therefore the current meta-correct... (read 485 more words →)

Relitigating the Race to Build Friendly AI

Wei Dai

2mo

Recently I've been relitigating some of my old debates with Eliezer, to right the historical wrongs. Err, I mean to improve the AI x-risk community's strategic stance. (Relevant to my recent theme of humans being bad at strategy—why didn't I do this sooner?)

Of course the most central old debate was over whether MIRI's circa 2013 plan, to build a world-altering Friendly AI^[1], was a good one. If someone were to defend it today, I imagine their main argument would be that back then, there was no way to know how hard solving Friendliness/alignment would be, so it was worth a try in case it turned out to be easy. This may seem... (read 723 more words →)

This is (approximately) my forum.

I was curious what Habryka meant when he said this. Don't non-profits usually have some kind of board oversight? It turns out (from documents filed with the State of California), that Lightcone Infrastructure, which operates LW, is what's known as a sole-member nonprofit, with a 1-3 person board of directors determined by a single person (member), namely Oliver Habryka. (Edit: My intended meaning here is that this isn't just a historical fact, but Habryka still has this unilateral power. And after some debate in the comments, it looks like this is correct after all, but was unintentional. See Habryka's clarification.)

However, it also looks like the LW domain is... (read more)

Having finally experienced the LW author moderation system firsthand by being banned from an author's posts, I want to make two arguments against it that may have been overlooked: the heavy psychological cost inflicted on a commenter like me, and a structural reason why the site admins are likely to underweight this harm and its downstream consequences.

(Edit: To prevent a possible misunderstanding, this is not meant to be a complaint about Tsvi, but about the LW system. I understand that he was just doing what he thought the LW system expected him to do. I'm actually kind of grateful to Tsvi to let me understand viscerally what it feels like to be... (read more)

•••

Please, Don't Roll Your Own Metaethics

Wei Dai

3mo

One day, when I was an intern at the cryptography research department of a large software company, my boss handed me an assignment to break a pseudorandom number generator passed to us for review. Someone in another department invented it and planned to use it in their product, and wanted us to take a look first. This person must have had a lot of political clout or was especially confident in himself, because he rejected the standard advice that anything an amateur comes up with is very likely to be insecure and he should instead use one of the established, off the shelf cryptographic algorithms, that have survived extensive cryptanalysis (code breaking)... (read 506 more words →)

154

An update on this 2010 position of mine, which seems to have become conventional wisdom on LW:

In my posts, I've argued that indexical uncertainty like this shouldn't be represented using probabilities. Instead, I suggest that you consider yourself to be all of the many copies of you, i.e., both the ones in the ancestor simulations and the one in 2010, making decisions for all of them. Depending on your preferences, you might consider the consequences of the decisions of the copy in 2010 to be the most important and far-reaching, and therefore act mostly as if that was the only copy. [Emphasis added]

In the subsequent 15 years, I've upweighted influencing the multiverse... (read more)

Problems I've Tried to Legibilize

Wei Dai

3mo

Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I've organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, especially beyond LW/rationalists, to AI researchers, funders, company leaders, government policymakers, their advisors (including future AI advisors), and the general public.

Philosophical problems
1. Probability theory
2. Decision theory
3. Beyond astronomical waste (possibility of influencing vastly larger universes beyond our own)
4. Interaction between bargaining and logical uncertainty
5. Metaethics
6. Metaphilosophy: 1, 2
Problems with specific philosophical and alignment ideas
1. Utilitarianism: 1, 2
2. Solomonoff induction
3. "Provable" safety
4. CEV
5. Corrigibility
6. IDA (and

... (read 366 more words →)

137

Legible vs. Illegible AI Safety Problems

Wei Dai

3mo

Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)

From an x-risk perspective, working on highly legible safety problems has low... (read 420 more words →)

370

•••

Trying to understand my own cognitive edge

Wei Dai

3mo

I applaud Eliezer for trying to make himself redundant, and think it's something every intellectually successful person should spend some time and effort on. I've been trying to understand my own "edge" or "moat", or cognitive traits that are responsible for whatever success I've had, in the hope of finding a way to reproduce it in others, but I'm having trouble understanding a part of it, and try to describe my puzzle here. For context, here's an earlier EAF comment explaining my history/background and what I do understand about how my cognition differs from others.^[1]

More Background

In terms of raw intelligence, I think I'm smart but not world-class. My SAT was only 1440,... (read 1169 more words →)

Managing risks while trying to do good

Wei Dai

I often think about "the road to hell is paved with good intentions".^[1] I'm unsure to what degree this is true, but it does seem that people trying to do good have caused more negative consequences in aggregate than one might naively expect.^[2] "Power corrupts" and "power-seekers using altruism as an excuse to gain power" are two often cited reasons for this, but I think don't explain all of it.

A more subtle reason is that even when people are genuinely trying to do good, they're not entirely aligned with goodness. Status-seeking is a powerful motivation for almost all humans, including altruists, and we frequently award social status to people for merely trying to do... (read 330 more words →)

LESSWRONG
LW

LESSWRONG
LW

Legible vs. Illegible AI Safety Problems

A tale from Communist China

Morality is Scary

UDT shows that decision theory is more puzzling than ever

Wei Dai

Wei Dai

Increasing AI Strategic Competence as a Safety Approach

My 2003 Post on the Evolutionary Argument for AI Misalignment

A Conflict Between AI Alignment and Philosophical Competence

Relitigating the Race to Build Friendly AI

Please, Don't Roll Your Own Metaethics

Problems I've Tried to Legibilize

Legible vs. Illegible AI Safety Problems

Wei Dai

Legible vs. Illegible AI Safety Problems

A tale from Communist China

Morality is Scary

UDT shows that decision theory is more puzzling than ever

Wei Dai

Wei Dai

Increasing AI Strategic Competence as a Safety Approach

My 2003 Post on the Evolutionary Argument for AI Misalignment

A Conflict Between AI Alignment and Philosophical Competence

Relitigating the Race to Build Friendly AI

Please, Don't Roll Your Own Metaethics

Problems I've Tried to Legibilize

Legible vs. Illegible AI Safety Problems

My 6 years as a trader / active investor

More Background