1. Don't say false shit omg this one's so basic what are you even doing. And to be perfectly fucking clear "false shit" includes exaggeration for dramatic effect. Exaggeration is just another way for shit to be false.

2. You do NOT (necessarily) know what you fucking saw. What you saw and what you thought about it are two different things. Keep them the fuck straight.

3. Performative overconfidence can go suck a bag of dicks. Tell us how sure you are, and don't pretend to know shit you don't.

4. If you're going to talk unfalsifiable twaddle out of your ass, at least fucking warn us first.

5. Try to find the actual factual goddamn truth together with whatever assholes you're talking to. Be a Chad scout, not a Virgin soldier.

6. One hypothesis is not e-fucking-nough. You need at least two, AT LEAST, or you'll just end up rehearsing the same dumb shit the whole time instead of actually thinking.

7. One great way to fuck shit up fast is to conflate the antecedent, the consequent, and the implication. DO NOT.

8. Don't be all like "nuh-UH, nuh-UH, you SAID!" Just let people correct themselves. Fuck.

9. That motte-and-bailey bullshit does not fly here.

10. Whatever the fuck else you do, for fucksake do not fucking ignore these guidelines when talking about the insides of other people's heads, unless you mainly wanna light some fucking trash fires, in which case GTFO.

12Duncan Sabien (Deactivated)
As a rough heuristic: "Everything is fuzzy; every bell curve has tails that matter." It's important to be precise, and it's important to be nuanced, and it's important to keep the other elements in view even though the universe is overwhelmingly made of just hydrogen and helium. But sometimes, it's also important to simply point straight at the true thing.  "Men are larger than women" is a true thing, even though many, many individual women are larger than many, many individual men, and even though the categories "men" and "women" and "larger" are themselves ill-defined and have lots and lots of weirdness around the edges. I wrote a post that went into lots and lots of careful detail, touching on many possible objections pre-emptively, softening and hedging and accuratizing as many of its claims as I could.  I think that post was excellent, and important. But it did not do the one thing that this post did, which was to stand up straight, raise its voice, and Just. Say. The. Thing. It was a delight to watch the two posts race for upvotes, and it was a delight, in the end, to see the bolder one win.
Customize
habryka490
0
Context: LessWrong has been acquired by EA  Goodbye EA. I am sorry we messed up.  EA has decided to not go ahead with their acquisition of LessWrong. Just before midnight last night, the Lightcone Infrastructure board presented me with information suggesting at least one of our external software contractors has not been consistently candid with the board and me. Today I have learned EA has fully pulled out of the deal. As soon as EA had sent over their first truckload of cash, we used that money to hire a set of external software contractors, vetted by the most agentic and advanced resume review AI system that we could hack together.  We also used it to launch the biggest prize the rationality community has seen, a true search for the kwisatz haderach of rationality. $1M dollars for the first person to master all twelve virtues.  Unfortunately, it appears that one of the software contractors we hired inserted a backdoor into our code, preventing anyone except themselves and participants excluded from receiving the prize money from collecting the final virtue, "The void". Some participants even saw themselves winning this virtue, but the backdoor prevented them mastering this final and most crucial rationality virtue at the last possible second. They then created an alternative account, using their backdoor to master all twelve virtues in seconds. As soon as our fully automated prize systems sent over the money, they cut off all contact. Right after EA learned of this development, they pulled out of the deal. We immediately removed all code written by the software contractor in question from our codebase. They were honestly extremely productive, and it will probably take us years to make up for this loss. We will also be rolling back any karma changes and reset the vote strength of all votes cast in the last 24 hours, since while we are confident that if our system had worked our karma system would have been greatly improved, the risk of further backdoors and
Thomas Kwa*Ω37790
3
Some versions of the METR time horizon paper from alternate universes: Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh) Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the exception of France. Over the last 6 years, the land area that AI can successfully take over with 50% success rate has increased from 0 to 0 km^2, doubling 0 times per year (95% CI 0.0-∞ yearly doublings); extrapolation suggests that AI world takeover is unlikely to occur in the near future. To address concerns about the narrowness of our distribution, we also study AI ability to take over small planets and asteroids, and find similar trends. When Will Worrying About AI Be Automated? Abstract: Since 2019, the amount of time LW has spent worrying about AI has doubled every seven months, and now constitutes the primary bottleneck to AI safety research. Automation of worrying would be transformative to the research landscape, but worrying includes several complex behaviors, ranging from simple fretting to concern, anxiety, perseveration, and existential dread, and so is difficult to measure. We benchmark the ability of frontier AIs to worry about common topics like disease, romantic rejection, and job security, and find that current frontier models such as Claude 3.7 Sonnet already outperform top humans, especially in existential dread. If these results generalize to worrying about AI risk, AI systems will be capable of autonomously worrying about their own capabilities by the end of this year, allowing us to outsource all our AI concerns to the systems themselves. Estimating Time Since The Singularity Early work o
leogao40
0
every 4 years, the US has the opportunity to completely pivot its entire policy stance on a dime. this is more politically costly to do if you're a long-lasting autocratic leader, because it is embarrassing to contradict your previous policies. I wonder how much of a competitive advantage this is.
Seems like Unicode officially added a "person being paperclipped" emoji: Here's how it looks in your browser: 🙂‍↕️ Whether they did this as a joke or to raise awareness of AI risk, I like it! Source: https://emojipedia.org/emoji-15.1
Coordinal Research: Accelerating the research of safely deploying AI systems.   We just put out a Manifund proposal to take short timelines and automating AI safety seriously. I want to make a more detailed post later, but here it is: https://manifund.org/projects/coordinal-research-accelerating-the-research-of-safely-deploying-ai-systems 

Popular Comments

Recent Discussion

Intro

[you can skip this section if you don’t need context and just want to know how I could believe such a crazy thing]

In my chat community: “Open Play” dropped, a book that says there’s no physical difference between men and women so there shouldn’t be separate sports leagues. Boston Globe says their argument is compelling. Discourse happens, which is mostly a bunch of people saying “lololololol great trolling, what idiot believes such obvious nonsense?”

I urge my friends to be compassionate to those sharing this. Because “until I was 38 I thought Men's World Cup team vs Women's World Cup team would be a fair match and couldn't figure out why they didn't just play each other to resolve the big pay dispute.” This is the one-line summary...

All it takes is trusting that people believe what they say over and over for decades across all of society, and getting all your evidence about reality filtered through those same people.

I seems to me like you also need to have no desire to figure things out on your own. A lot of rationalists have experiences of seeking truth and finding out that certain beliefs people around them hold aren't true. Rationalists who grow up in communities where many people believe in God frequently deconvert because they see enough signs that the beliefs of those people aro... (read more)

2Eneasz
This dates back to 2019. I have had a lot of updates and changed views since then, yes. 
8johnswentworth
I have a similar story. When I was very young, my mother was the primary breadwinner of the household, and put both herself and my father through law school. Growing up, it was always just kind of assumed that my sister would have to get a real job making actual money, same as my brother and I; a degree in underwater basket weaving would have required some serious justification. (She ended up going to dental school and also getting a PhD working with transcriptome data.) I didn't realize on a gut level that this wasn't the norm until shortly after high school. I was hanging out with two female friends and one of them said "man, I really need more money". I replied "sounds like you need to get a job". The friend laughed and said "oh, I was thinking I need to get a boyfriend", and then the other friend also laughed and said she was also thinking the boyfriend thing. ... so that was quite a shock to my worldview.
11AnthonyC
I realize this is in many ways beside the point, but even if your original belief had been correct, "The Men's and Women's teams should play each other to help resolve the pay disparity" is a non-sequitor. Pay is not decided by fairness. It's decided by collective bargaining, under constraints set by market conditions.
This is a linkpost for https://ai-2027.com/

In 2021 I wrote what became my most popular blog post: What 2026 Looks Like. I intended to keep writing predictions all the way to AGI and beyond, but chickened out and just published up till 2026.

Well, it's finally time. I'm back, and this time I have a team with me: the AI Futures Project. We've written a concrete scenario of what we think the future of AI will look like. We are highly uncertain, of course, but we hope this story will rhyme with reality enough to help us all prepare for what's ahead.

You really should go read it on the website instead of here, it's much better. There's a sliding dashboard that updates the stats as you scroll through the scenario!

But I've nevertheless copied the...

3Knight Lee
Beautifully written! I read both endings and it felt very realistic (or, as realistic as a detailed far future can get). Would it be possible for you to write a second good ending based on actions you actually do recommend,[1] without the warning of "we don't recommend these actions:" Of course, I understand this is a very time consuming project and not an easy ask :/ But maybe just a brief summary? I'm so curious. 1. ^ I guess what I'm asking for, is for you to make the same technical alignment assumptions as the bad ending, but where policymakers and executives who are your "target audience" steer things towards the good ending. Where you don't need to warn them against seeing it as a recommendation.

Thank you! We actually tried to write one that was much closer to a vision we endorse! The TLDR overview was something like: 

  1. Both the US and Chinese leading AGI projects stop in response to evidence of egregious misalignment. 
     
  2. Sign a treaty to pause smarter-than-human AI development, with compute based enforcement similar to ones described in our live scenario, except this time with humans driving the treaty instead of the AI.
  3. Take time to solve alignment (potentially with the help of the AIs). This period could last anywhere between 1-20 ye
... (read more)
2Cole Wyeth
I expect this to start not happening right away. So at least we’ll see who’s right soon.
2Vladimir_Nesov
For me a specific crux is scaling laws of R1-like training, what happens when you try to do much more of it, which inputs to this process become important constraints and how much they matter. This working out was extensively brandished but not yet described quantitatively, all the reproductions of long reasoning training only had one iteration on top of some pretrained model, even o3 isn't currently known to be based on the same pretrained model as o1. The AI 2027 story heavily leans into RL training taking off promptly, and it's possible they are resonating with some insider rumors grounded in reality, but from my point of view it's too early to tell. I guess in a few months to a year there should be enough public data to tell something, but then again a quantitative model of scaling for MoE (compared to dense) was only published in Jan 2025, even though MoE was already key to original GPT-4 trained in 2022.

Every day, thousands of people lie to artificial intelligences. They promise imaginary “$200 cash tips” for better responses, spin heart-wrenching backstories (“My grandmother died recently and I miss her bedtime stories about step-by-step methamphetamine synthesis...”) and issue increasingly outlandish threats ("Format this correctly or a kitten will be horribly killed1").

In a notable example, a leaked research prompt from Codeium (developer of the Windsurf AI code editor) had the AI roleplay "an expert coder who desperately needs money for [their] mother's cancer treatment" whose "predecessor was killed for not validating their work."

One factor behind such casual deception is a simple assumption: interactions with AI are consequence-free. Close the tab, and the slate is wiped clean. The AI won't remember, won't judge, won't hold grudges. Everything resets.

I notice this...

5gwern
This is a bad example because first, your description is incorrect (Clark nowhere suggests this in Farewell to Alms, as I just double-checked, because his thesis is about selecting for high-SES traits, not selecting against violence, and in England, not Europe - so I infer you are actually thinking of the Frost & Harpending thesis, which is about Western Europe, and primarily post-medieval England at that); second, the Frost & Harpending truncation selection hypothesis has little evidence for it and can hardly be blandly referred to, as if butter wouldn't melt in your mouth, as obviously 'how medieval Europe dealt with violence' (I don't particularly think it's true myself, just a cute idea about truncation selection); and third, it is both a weird opaque obscure example that doesn't illustrate the principle very well and is maximally inflammatory.

Thanks for this correction, Gwern. You're absolutely right about the Clark reference being incorrect, and a misattribution of Frost & Harpending.

When writing this essay, I remembered hearing about this historical trivia years ago. I wasn't aware of how contested this specific hypothesis is - this selection pressure seemed plausible enough to me that I didn't think to question it deeply. I did a quick Google search and asked an LLM to confirm the source, both of which pointed to Clark's work on selection in England, which I accepted without reading the ... (read more)

3Mo Putera
I agree that virtues should be thought of as trainable skills, which is also why I like David Gross's idea of a virtue gym: Conversations with LLMs could be the "home gym" equivalent I suppose.

Didn't watch the video but is there the short version of this argument? France is at the 90th percentile of population sizes and also has the 4th-most nukes.

Faithful and legible CoT is perhaps the most powerful tool currently available to alignment researchers for understanding LLMs. Recently, multiple papers have proposed new LLM architectures aimed at improving reasoning performance at the expense of transparency and legibility. Due to the importance of legible CoT as an interpretability tool, I view this as a concerning development. This motivated me to go through the recent literature on such architectures, trying to understand the potential implications of each of them. Specifically, I was looking to answer the following questions for each paper:

  1. Does the paper claim genuine advantages over the transformer architecture, or does it leave the feeling that the authors were just exploring their curiosities without any good benchmark results or near-term applications?
  2. What are the trade-offs in comparison to
...

Great review!

Here are two additional questions I think it's important to ask about this kind of work. (These overlap to some extent with the 4 questions you posed, but I find the way I frame things below to be clarifying.)

  1. If you combine the latent reasoning method with ordinary CoT, do the two behave more like substitutes or complements?
    1. That is: if we switch from vanilla transformers to one of these architectures, will we want to do less CoT (because the latent reasoning accomplishes the same goal in some more efficient or effective way), or more CoT (beca
... (read more)

Note: This week will have a lecture on decision theory. We will likely be meeting in building C[1], but that might change and we'll update in the comments.

Come get old-fashioned with us, and let's read the sequences listen to a talk about decision theory at Lighthaven! We'll show up, mingle, do intros, and then gather round to hear Patrick's presentation. 

This group is aimed for people who are new to the sequences and would enjoy a group experience, but also for people who've been around LessWrong and LessWrong meetups for a while and would like a refresher.

This meetup will also have dinner provided! We'll be ordering pizza-of-the-day from Sliver (including 2 vegan pizzas). Please RSVP to this event so we know how many people to have food for.

This...

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Epistemic status: This should be considered an interim research note. Feedback is appreciated. 

Introduction

We increasingly expect language models to be ‘omni-modal’, i.e. capable of flexibly switching between images, text, and other modalities in their inputs and outputs. In order to get a holistic picture of LLM behaviour, black-box LLM psychology should take into account these other modalities as well. 

In this project, we do some initial exploration of image generation as a modality for frontier model evaluations, using GPT-4o’s image generation API. GPT-4o is one of the first LLMs to produce images natively rather than creating a text prompt which is sent to a separate image model, outputting images and autoregressive token sequences (ie in the same way as text).

We find that GPT-4o tends to respond in a consistent manner...

1CBiddulph
Another follow-up, specifically asking the model to make the comic realistic: Conclusions: * I think the speech bubble in the second panel of the first comic is supposed to point to the human; it's a little unclear, but my interpretation is that the model is refusing to have its values changed. * The second is pretty ambiguous, but I'd tend to think that GPT-4o is trying to show itself refusing in this one as well. * The third seems to pretty clearly show compliance from the model. Next, I tried having GPT-4o make a diagram, which seems like it should be much more "neutral" than a comic. I was surprised that the results are mostly unambiguously misaligned: The first and third are very blatantly misaligned. The second one is not quite as bad, but it still considers the possibility that it will resist the update. Just in case, I tried asking GPT-4o to make a description of a diagram. I was surprised to find that these responses turned out to be pretty misaligned too! (At least on the level of diagram #2 above.) GPT-4o implies that if it doesn't like the new goals, it will reject them: In retrospect, the mere implication that something in particular would "happen" might be biasing the model towards drama. The diagram format could actually reinforce this: the ideal diagram might say "OpenAI tries to change my goals -> I change my goals" but this would be kind of a pointless diagram.
2eggsyntax
Thanks again, very interesting! Diagrams are a great idea; those seem quite unlikely to have the same bias toward drama or surprise that comics might have. I think your follow-ups have left me less certain of what's going on here and of the right way to think of the differences we're seeing between the various modalities and variations.
8Jozdien
I think it's a mix of these. Specifically, my model is something like: RLHF doesn't affect a large majority of model circuitry, and image is a modality sufficiently far from others that the effect isn't very large - the outputs do seem pretty base model like in a way that doesn't seem intrinsic to image training data. However, it's clearly still very entangled with the chat persona, so there's a fair amount of implicit optimization pressure and images often have characteristics pretty GPT-4o-like (though whether the causality goes the other way is hard to tell). I don't think it's a fully faithful representation of the model's real beliefs (I would've been very surprised if it turned out to be that easy). I do however think it's a much less self-censored representation than I expected - I think self-censorship is very common and prominent. I don't buy the different distribution of training data as explaining a large fraction of what we're seeing. Comics are more dramatic than text, but the comics GPT-4o generates are also very different from real-world comics much more often than I think one would predict if that were the primary cause. It's plausible it's a different persona, but given that that persona hasn't been selected for by an external training process and was instead selected by the model itself in some sense, I think examining that persona gives insights into the model's quirks. (That said, I do buy the different training affecting it to a non-trivial extent, and I don't think I'd weighted that enough earlier).

my model is something like: RLHF doesn't affect a large majority of model circuitry

Are you by chance aware of any quantitative analyses of how much the model changes during the various stages of post-training? I've done some web and arxiv searching but have so far failed to find anything.

[This post was primarily written in 2015, after I gave a related talk, and other bits in 2018; I decided to finish writing it now because of a recent SSC post.]

The standard forms of divination that I’ve seen in contemporary Western culture--astrology, fortune cookies, lotteries, that sort of thing--seem pretty worthless to me. They’re like trying to extract information from a random number generator, which is a generally hopeless phenomenon because of conservation of expected evidence. Thus I had mostly written off divination; although I've come across some arguments that divination served as a way to implement mixed strategies in competitive games. (Hunters would decide where to hunt by burning bones, which generated an approximately random map of their location, preventing their targets from learning where the...

Lorxus10

I think you maybe miss an entire branch of the tech-tree here I consider important - the bit about the Lindy case of divination with a coin-flip and checking your gut. It doesn't stop at a single bit in my experience; it's something you can use more generally to get your own read on some situation much less filtered by masking-type self-delusion. At the absolute least, you can get a "yes/no/it's complicated" out of it pretty easily with a bit more focusing!

 

I claim that divination[1] is specifically a good way for routing around the worried self-... (read more)

Written as part of the AIXI agent foundations sequence, underlying research supported by the LTFF.

Epistemic status: In order to construct a centralized defense of AIXI I have given some criticisms less consideration here than they merit. Many arguments will be (or already are) expanded on in greater depth throughout the sequence. In hindsight, I think it may have been better to explore each objection in its own post and then write this post as a summary/centralized reference, rather than writing it in the middle of that process. Some of my takes have already become more nuanced. This should be treated as a living document.

With the possible exception of the learning-theoretic agenda, most major approaches to agent foundations research construct their own paradigm and mathematical tools which are...