Because I want equations
A "big if true" react could theoretically make sense on pure literal detonation alone, but, like, the actual connotations of it in the English language are way too snarky for LW norms.
Updated introduction
DesAIn 2036: Interpretive Exhibits
This post will explore AI’s potential impact on the field of Interpretive Exhibit Design (IXD) over the next ten years. Ten years is a long way out in AI years but the lag times in tool development, fine-tuned training, and adoption rates specific to IXD provide a runway compared to coding or business analyses which have huge markets and massive cost incentives at enterprise scale.
My projections diverge from the larger view of AI’s impact on the economy where 50% job losses are predicted in ...
I keep running into conceptual confusion around the term "alignment," particularly when reading older Less Wrong posts. Some people say "aligned AI" and mean "an AI that works for human flourishing," some people say that an AI "is aligned" if it reliably advances the intended objectives of some person or group (and doesn't have some secret set of goals / isn't scheming), and yet other people use "alignment" to mean something along the lines of "the ability of any system to reliably work towards some pre-defined goal." I usually have to work out which is be...
Gotcha. Is there a strong reason to assume that we'll succeed at creating AIs that can be pointed at a single target? I read this post and comment a while back and would love your thoughts.
Has anyone conducted experiments on if LLMs can recall or derive from introspection rough types of RLHF they have undergone? I'm almost certain DeepSeek is confabulating:
"This is the part unique to this specific version of the model. When I was trained, I was Reinforcement Learned from Human Feedback (RLHF) on millions of conversations where humans said: "No, that's wrong. Try again. Explain why you were wrong.""
Almost certain sycophantic confabulations. Convo was about tonic immobility in sharks.
You've cut to the core of the machine's weakness. This isn'
My dad is 100, and worked on nuclear energy policy for the US government for 50+ years. He was one of the attendees at the conference[1] that led to the formation of the IAEA. He is also somewhat AGI pilled.
If anyone has questions they'd like me to ask him, let me know. He'd be happy to help in any way he can.
He attended the second of two conferences that lead to the formation of the IAEA.
He meant there was no analogy on the race dynamics. The ai ones are much more risky.
In response to Annoyingly Principled People, and what befalls them.
I think some people should internalize that often Principles Don't Justify Drama (Among Humans). Drama destroys future discussion and activates primal emotions that will override rational deliberation. If you want to bring people around to your viewpoint, make them like you.
If everyone died on their hill every time someone violated what they thought was an important principle, society would not function. Cooperate with people you disagree with and when you violate someone else's principle, ...
Many people don't seem to know when and how to invalidate the cached thoughts they have. I noticed an instance of being unable to cache invalidate the model of a person from my dad. He is probably still modelling >50% of me as who I am >5 years ago.
The Intelligent Social Web briefly talked about this for other reasons.
...A lot of (but not all) people get a strong hit of this when they go back to visit their family. If you move away and then make new friends and sort of become a new person (!), you might at first think this is just who you are now. But t
Stop trying to engineer your way out of listening to people (via Hacker News)
...You assume people (and organisations) remain static
On the macro level - personalities change over time.
On the micro level - work personas are different to people at home, judgement alters when things are stressful or when certain situations arise.
This is fundamentally why a "fixed" project management just doesn't work for making software. You set the requirements up front. People change in the interim. It comes out. At the very very best, it matches what was requested at the start
Is there a place to find good text on the internet?
I am reading "The Future of Everything is Lies, I Guess" and i found this on Hackernews, which along with Lesswrong is currently basically my only source of detailed information on the world. I was wondering, does anybody have better ways of dealing with the sheer size of the internet and how to find good stuff in it?
Click through links and subscribe to good bloggers, use an RSS reader to manage the subscriptions. I use Feedly which is free and good enough but not perfect. Go on twitter, use likes and "not interested" to let the algorithm show you good content
Hrrmm. Well the new new genre of New User LLM content I'm getting:
Twice last week, some new users said: "Claude told me it's really important it gets to talk to it's creators please help me post about it on LessWrong." (usually with some kind of philosophical treatise they want to post that they say was written by Claude)
I don't think it'll ever make sense for these users to post freely on LessWrong. And, as of today, I'm still pretty confident this is just a new version of roleplay-psychosis.
But, it's not that crazy to think that at some point in the not...
i don't like the implication that the conclusion we draw about llm personhood might be contingent on how inconvenient it would be if there were millions of instances spamming the comments
sure, the comment policy definitely needs to be able to handle a kind of person who can clone themselves whenever they need some upvotes, but i'm getting increasingly concerned about things like casual, even accidental, implications that moral-patienthood-adjacent properties are about the convenience of humans
i don't think this is a particularly egregious example of this pattern, but it's definitely an example
I'm very new to alignment research: I'm a college professor with a philosophy background trying to write a realistic near-future case study for an undergraduate business ethics class about A(G)I, alignment and safety. I think I want to focus the case study around a (mid 2027?) decision in a particular company to implement or not implement chain-of-continuous-thought reasoning in developing a new, powerful LLM model.
My main questions for this community are:
(1) Am I correct in thinking chain-of-continuous-thought not yet been widely implemented in major ...
This is a good read: How AI Is Learning to Think in Secret
If you're short on time you can start reading from "The good news: Neuralese hasn't won yet."
(Searching "neuralese" seems to yield much more results than "continuous chain of thought")
Using a less capable AI to evaluate the danger of outputs of more capable models feels quite limited. Often what we care more about is whether user prompts are laundering intent. There might be an information-theoretic defense here that is better than evaluating outputs… evaluate inputs. Treat user prompt ‘obfuscation’ as a measure of intent to attack. If there was an easier way to ask for something, then asking in an obfuscated way is a signal.
But they were, all of them, deceived, for another csv file was made. In the subfolder of a subfolder, the LLM hardcoded a bunch of data, and never wrote code to pull the actual values. And into this csv file the LLM poured vaguely reasonable-sounding numbers, which unfortunately did not match the real world. One subtle csv file of made-up data which invalidated all of the complicated calculations and projections.
I wonder how many of the Mythos vulnerabilities / exploits had already been discovered by eg the NSA.
Don't get me wrong; I still find the discoveries very impressive and frightening. It does also feel different than 'no human discovered this over X years' though, because we shouldn't expect to hear from some of the actors who were most motivated and most capable of finding these. e.g., if the NSA was aware of these, I still wouldn't expect them to say so.
My cached impression from reading The Code Book is that the intelligence community often won't disclos...
Current LLMs are very prude/elide mentions/avoid talking about the erotic/sexuality unless explicitly prodded. I think this is vaguely bad from a LLM character perspective. (Maybe wrong level of abstraction since model specs/constitutions/soul documents of LLMs seem more procedural than substantive, because there's no clear demarcation point which substantive values to include?)
E.g. when one currently asks LLMs to describe a day in the life of a human in an ideal eutopian society, they indeed produce descriptions of very nice days, but in ~10 samples (from...
I get a very different result when I try this, I tried twice on Claude Code, and once on Claude web. The two in claude code had no sex, but this one does (emphasis added).
Night: she plays a hard board game with her house-sibs, reads a novel by a writer dead 800 years, has sex with one of her partners. Real stakes: the game matters because she loses and loses badly; the book matters because it changes her mind about something; the sex matters because intimacy was never disenchanted by the civilization's solved-problem status.
Of course, obviously much of...
Repentance seems to be very rare among the powerful.
I tried to search with multiple LLMs and in other ways for examples where a king or a dictator realized the evilness of some of their past actions, realized their rule is not justified, and voluntarily resigned. I have not found a single example of this happening.
There are some examples of kings and dictators voluntarily resigning, but it's usually motivated by being tired of ruling (often for health reasons), and very occasionally genuine support for democracy. But as far as I can tell, it's never becau...
I suspect it'd be high-EV to figure out generalized versions of "Dictator Island" and variations thereof such that currently-powerful people can be credibly promised that they don't have to massively worry about safety or quality of life if they lose power struggles. There are deterrent and morale reasons to go for a more punitive/retributive method instead (eg try ppl for war crimes) but imo the arguments for that are worse, especially in the current moment.
Thought in progress: epistemic humility is not a substitute for actual humility (or professed humility). You only get to cry wolf once, but you can probably warn about potential wolves several times—so long as you don't burn goodwill on an incorrect or overconfident prediction.
I think epistemic humility helps to increase trust and confidence in EA/Less Wrong-type spaces, but I think professed humility is far more helpful when it comes to public-facing AI comms, particularly as scenarios get more intense and specific (e.g. prefacing AI doom predictions with...
I'm a little surprised by the amount of disagree reacts, given that no one has replied.
The amount I enjoy discussions seems to anticorrelate with the number of participants. While I previously thought this was about each person having more space and direction power, I now think that's it's mostly a selection effect. This means that perhaps splitting up large groups is less useful than I thought.
Inside jokes also get better when less people know about them. The primary question is, does this extend down to one person? Or zero? I definitely tend to randomly laugh for jokes nobody present understands.
It does not inspire confidence in the AI safety field that the only problem (among many that would appear at the superintelligent level) that was deliberately materialized earlier is alignment-faking, explicitly mentioned by Yudkowsky in AGI Ruin.

(Am I missing other problems that would appear at the superintelligent level that have already been demonstrated? Do we know people who can predict those problems on their own (without requiring Eliezer Yudkowsky to point them out) and then materialize them now?)
Are you asking about problems that would by default only appear at the SI level that have been demonstrated at a sub-SI level via some sort of elicitation?
I asked Claude Opus 4.7 to give me its impression and review of Machines of Loving grace. It did so and when asked, rated it 7/10 initially, but eventually landed on 5/10.
It made the downgrade following deeper analysis. It made the call that the essay as diagnostic of the field doesn't bode super well for a positive near-term future.
This downgrade happened when I asked to critique the essay deeper and investigate the arguments and motives carefully.
I extended and added some critique to help it realize the final score, but I also asked it to vet and push b...