Some of Eliezer's founder effects on the AI alignment/x-safety field, that seem detrimental and persist to this day:
I reject the idea that I'm confused at all.
Tons of people have said "Ethical realism is false", for a very long time, without needing to invent the term "meta-ethics" to describe what they were doing. They just called it ethics. Often they went beyond that and offered systems they thought it was a good idea to adopt even so, and they called that ethics, too. None of that was because anybody was confused in any way.
"Meta-ethics" lies within the traditional scope of ethics, and it's intertwined enough with the fundamental concerns of ethics that it's not rea...
I want to make a thing that talks about why people shouldn't work at Anthropic on capabilities and all the evidence that points in the direction of them being a bad actor in the space, bound by employees who they have to deceive.
A very early version of what it might look like: https://anthropic.ml
Help needed! Email me (or DM on Signal) ms@contact.ms (@misha.09)
If your theory of change is convincing Anthropic employees or prospective Anthropic employees they should do something else, I think your current approach isn't going to work. I think you'd probably need to much more seriously engage with people who think that Anthropic is net-positive and argue against their perspective.
Possibly, you should just try to have less of a thesis and just document bad things you think Anthropic has done and ways that Anthropic/Anthropic leadership has misled employees (to appease them). This might make your output more useful i...
Over a decade ago I read this 17 year old passage from Eliezer
...When Marcello Herreshoff had known me for long enough, I asked him if he knew of anyone who struck him as substantially more natively intelligent than myself. Marcello thought for a moment and said "John Conway—I met him at a summer math camp." Darn, I thought, he thought of someone, and worse, it's some ultra-famous old guy I can't grab. I inquired how Marcello had arrived at the judgment. Marcello said, "He just struck me as having a tremendous amount of mental horsepow
I wonder how Eliezer would describe his "moat", i.e., what cognitive trait or combination of traits does he have, that is rarest or hardest to cultivate in others? (Would also be interested in anyone else's take on this.)
There is a recent intense interest in space-based datacenters.
I see almost no economic benefits to this in the next, say, 3 decades and see it as almost a recession indicator in itself.
However, it could allow the datacenter owners significantly less (software) scrutiny from regulators.
Are there any economic arguments I'm missing? Could the regulator angle be the real unstated benefit behind them?
One argument is that energy costs 0.1 cents per kWh versus 5 cents on Earth. For now launch costs dominate this but in the future this balance might change.
Accidental AI Safety experiment by PewDiePie: He created his own self-hosted council of 8 AIs to answer questions. They voted and picked the best answer. He noticed they were always picking the same two AIs, so he discarded the others, made the process of discarding/replacing automatic, and told the AIs about it. The AIs started talking about this "sick game" and scheming to prevent that. This is the video with the timestamp:
I don't think dealmaking will buy us much safety. This is because I expect that:
That said, I have been thinking about dealmaking because:
Computational Valence: Pain as NMI
Model valence as allostatic control in predictive agents with deadline-bound loops. Pain = Non-Maskable Interrupt: unmaskable, hard-preempt signal for survival-critical prediction errors; seizes executive control until the error is terminated/resolved. Pleasure = deferable optimization: reward logging for RL; no preemptive mandate. Implications:
Quick context: this sketch came out of a short exchange with Thomas Metzinger (Aug 21, 2025). He said, roughly, that we can’t answer the “when does synthetic phenomenology begin?” question yet, and that it “happens when the global epistemic space embeds a model of itself as a whole.” I took that as a target and tried to operationalize one path to it.
My proposal is narrower: if you build a predictive/allostatic agent with deadline-bound control loops and you add a self-model–accessing, non-maskable interrupt that can preempt any current policy to resolve su...
There appears to be a distaste/disregard for AI ethics (mostly here referring to bias and discrimination) research in LW. Generally the idea is that such research misses the point, or is not focused on the correct kind of misalignment (i.e. the existential kind). I think AI ethics research is important (beyond its real world implications) just like RL reward hacking in video game settings. In both cases we are showing that models learn unintended priorities, behaviours, and tendencies from the training process. Actually understanding how these tendencies form during training will be important for improving our understanding of SL and RL more generally.
It's interesting to read this in the context of the discussion of polarisation. Was this the first polarisation?
I'd be really interested in someone trying to answer the question: what updates on the a priori arguments about AI goal structures should we make as a result of empirical evidence that we've seen? I'd love to see a thoughtful and comprehensive discussion of this topic from someone who is both familiar with the conceptual arguments about scheming and also relevant AI safety literature (and maybe AI literature more broadly).
Maybe a good structure would be, from the a priori arguments, identifying core uncertainties like "How strong is the imitative prior?" A...
Copypasting from a slack thread:
I'll list some work that I think is aspiring to build towards an answer to some of these questions, although lots of it is very toy:
Why don’t we think about and respect the miracle of life more?
The spiders in my home continue to provide me with prompts for writing.
As I started taking a shower this morning, I noticed a small spider on the tiling. While I generally capture and release spiders from my home into the wild, this was an occasion where it was too inconvenient to: 1) stop showering, 2) dry myself, 3) put on clothes, 4) put the spider outside.
I continued my shower and watched the spider, hoping it might figure out some form of survival.
It came very close.
First it was meandering ...
I present four methods to estimate the Elo Rating for optimal play: (1) comparing optimal play to random play, (2) comparing optimal play to sensible play, (3) extrapolating Elo rating vs draw rates, (4) extrapolating Elo rating vs depth-search.
Random plays completely random legal moves. Optimal plays perfectly. Let ΔR denote the Elo gap between Random and Optimal. Random's expected score is given by E_Random = P(Random wins) + 0.5 × P(Random draws). This is related to Elo gap via the formula E_Ran...
I agree none of this is relevant to anything, I was just looking for intrinsically interesting thoughts about optimal chess.
I thought at least CDT could be approximated pretty well with a bounded variant; causal reasoning is a normal thing to do. FDT is harder, but some humans seem to find it a useful perspective, so presumably you can have algorithms meaningfully closer or further, and that is a useful proxy for something.
Actually never mind, I have no experience with the formalisms.
I guess "choose the move that maximises your expected value" is technical...
people look into universal moral frameworks like utilitarianism and EA because they lack self-confidence to take a subjective personal point of view. They need to support themselves with an "objective" system to feel confident that they are doing the correct thing. They look for external validation.
Huh. You are kind of proving my point here and you don't even seem to realize it. Alright, I will answer.
Don't guess, do research and question your bias. There is significant hard data on this. Gates foundation was focued on stuff like buying mosquito nets (an EA classic) and vaccinations, but somehow several of these efforts just failed. They tried to figure out why.
SJW terminology and 'vibes' is stuff intelligent people don't really bother with much where I live. We are not living in propaganda bubbles in EU as much as some other places.
Approach th...
Some very unhinged conversations with 4o starting at 3:40 of this video. [EDIT: fixed link]
...… it started prompting more about baby me, telling me what baby me would say and do. But I kept pushing. Baby me would never say that. I just think baby me would have something more important to say.
I was a smart baby. Everyone in my family says that. Do you think I was a smart baby? Smarter than the other babies at the hospital at least?
I kept pushing, trying to see if it would affirm that I was not only the smartest baby in the hospital. Not just the smartest
epistemic status: Going out on a limb and claiming to have solved an open problem in decision theory[1] by making some strange moves. Trying to leverage Cunningham's law. Hastily written.
p(the following is a solution to Pascal's mugging in the relevant sense)≈25%[2].
Okay, setting (also here in more detail): You have a a Solomonoff inductor with some universal semimeasure as a prior. The issue is that the utility of programs can grow faster than your universal semimeasure can penalize them, e.g. a complexity prior has busy-beaver-like programs that produce ...
Does that sound right?
Can't give a confident yes because I'm pretty confused about this topic, and I'm pretty unhappy currently with the way the leverage prior mixes up action and epistemics. The issue about discounting theories of physics if they imply high leverage seems really bad? I don't understand whether the UDASSA thing fixes this. But yes.
That avoids the "how do we encode numbers" question that naturally raises itself.
I'm not sure how natural the encoding question is, there's probably an AIT answer to this kind of question that I don't know.
There has been a rash of highly upvoted quick takes recently that don't meet our frontpage guidelines. They are often timely, perhaps because they're political, pitching something to the reader or inside baseball. These are all fine or even good things to write on LessWrong! But I (and the rest of the moderation team I talked to) still want to keep the content on the frontpage of LessWrong timeless.
Unlike posts, we don't go through each quick take and manually assign it to be frontpage or personal (and posts are treated as personal until they're actively f...
I observe that https://www.lesswrong.com/posts/BqwXYFtpetFxqkxip/mikhail-samin-s-shortform?commentId=dtmeRXPYkqfDGpaBj isn't frontpage-y but remains on the homepage even after many mods have seen it. This suggests that the mods were just patching the hack. (But I don't know what other shortforms they've hidden, besides the political ones, if any.)
Wei Dai thinks that automating philosophy is among the hardest problems in AI safety.[1] If he's right, we might face a period where we have superhuman scientific and technological progress without comparable philosophical progress. This could be dangerous: imagine humanity with the science and technology of 1960 but the philosophy of 1460!
I think the likelihood of philosophy ‘keeping pace’ with science/technology depends on two factors:
I'm curious what you say about "which are the specific problems (if any) where you specifically think 'we really need to have solved philosophy / improved-a-lot-at-metaphilosophy' to have a decent shot at solving this?'"
Assuming by "solving this" you mean solving AI x-safety or navigating the AI transition well, I just post a draft about this. Or if you already read that and are asking for an even more concrete example, a scenario I often think about is an otherwise aligned ASI, some time into the AI transition when things are moving very fast (from a h...
Specifically, this is the privacy policy inherited from when LessWrong was a MIRI project; to the best of my knowledge, it hasn't been updated.
Mainstream belief: Rational AI agents (situationally aware, optimizes decisions, etc.) are superior problem solvers, especially if they can logically motivate their reasoning.
Alternative possibility: Intuition, abstraction and polymathic guessing will outperform rational agents in achieving competing problem-solving outcomes. Holistic reasoning at scale will force-solve problems intractable by much more formal agents, or at least outcompete in speed/complexity.
2)
Mainstream belief: Non-sentient machines will eventually r...
In Improving the Welfare of AIs: A Nearcasted Proposal (from 2023), I proposed talking to AIs through their internals via things like ‘think about baseball to indicate YES and soccer to indicate NO’. Based on the recent paper from Anthropic on introspection, it seems like this level of cognitive control might now be possible:
Communicating to AIs via their internals could be useful for talking about welfare/deals because the internals weren't ever trained against, potentially bypassing strong heuristics learned from training and also making it easier to con...
Has anyone done any experiments into whether a model can interfere with the training of a probe (like that bit in the most recent Yudtale) by manipulating its internals?