Customize
Chris_Leong*2311
0
I guess orgs need to be more careful about who they hire as forecasting/evals researchers. Sometimes things will happen, but three people at the same org... This is also a massive burning of the commons. It is valuable for forecasting/evals orgs to be able to hire people with a diversity of viewpoints in order to counter bias. But this only works if those less worried about AI risks who join such a collaboration don't use the knowledge they gain to cash in on the AI boom in an acceleratory way. Doing so undermines the very point of such a project, namely, to try to make AI go well. Also, let's suppose you're an x-risk funder considering whether to fund their previous org. This org does really high-quality work, but the argument for them being net-positive is now significantly weaker. This is quite likely to make finding future funding harder for them. This is less about attacking those three folks and more just noting that we need to strive to avoid situations where things like this happen, which includes thinking about who gets hired in the first place.
jacquesthibs5910
16
Three Epoch AI employees* are leaving to co-found an AI startup focused on automating work: "Mechanize will produce the data and evals necessary for comprehensively automating work." They also just released a podcast with Dwarkesh. *Matthew Barnett, Tamay Besiroglu, Ege Erdil
niplav*344
8
Law of one player: Any specific thing you just thought of will never happen[1] unless you (yes, you specifically) make it happen. ---------------------------------------- 1. Exceptions in cases where the thing (1) gives the person doing it status, (2) is profitable, (3) gets that person (a) high quality mate(s). ↩︎
... But It's Fake Tho Epistemic status: I don't fully endorse all this, but I think it's a pretty major mistake to not at least have a model like this sandboxed in one's head and check it regularly. Full-cynical model of the AI safety ecosystem right now: * There’s OpenAI, which is pretending that it’s going to have full AGI Any Day Now, and relies on that narrative to keep the investor cash flowing in while they burn billions every year, losing money on every customer and developing a product with no moat. They’re mostly a hype machine, gaming metrics and cherry-picking anything they can to pretend their products are getting better. The underlying reality is that their core products have mostly stagnated for over a year. In short: they’re faking being close to AGI. * Then there’s the AI regulation activists and lobbyists. They lobby and protest and stuff, pretending like they’re pushing for regulations on AI, but really they’re mostly networking and trying to improve their social status with DC People. Even if they do manage to pass any regulations on AI, those will also be mostly fake, because (a) these people are generally not getting deep into the bureaucracy which would actually implement any regulations, and (b) the regulatory targets themselves are aimed at things which seem easy to target (e.g. training FLOP limitations) rather than actually stopping advanced AI. The activists and lobbyists are nominally enemies of OpenAI, but in practice they all benefit from pushing the same narrative, and benefit from pretending that everyone involved isn’t faking everything all the time. * Then there’s a significant contingent of academics who pretend to produce technical research on AI safety, but in fact mostly view their job as producing technical propaganda for the regulation activists and lobbyists. (Central example: Dan Hendrycks, who is the one person I directly name mainly because I expect he thinks of himself as a propagandist and will not be particularly
We can probably survive in the following way: 1. RL becomes the main way to get new, especially superhuman, capabilities. 2. Because RL pushes models hard to do reward hacking, it's difficult to reliably get models to do something difficult to verify. Models can do impressive feats, but nobody is stupid enough to put AI into positions which usually imply responsibility. 3. This situation conveys how difficult alignment is and everybody moves toward verifiable rewards or similar approaches. Capabilities progress becomes dependent on alignment progress. 

Popular Comments

Recent Discussion

This is a linkpost for https://russellconjugations.com

Bertrand Russell noted how people often describe the same factual behavior using emotionally opposite language depending on perspective — e.g. I am firm, you are obstinate, he is pigheaded. This framing tactic is now called a Russell Conjugation, and once you start noticing them, they’re everywhere — especially in politics and media.

For the past year and a half, I’ve been training a finetuned ChatGPT model, and building a tool to automatically highlight Russell Conjugations in text and suggest emotionally opposite alternatives. It functions as a fact-independent bias reverser — showing where emotional spin might exist, and how the opposite side might see an issue, regardless of the factual accuracy of specific claims. I find it valuable especially when trying to parse tribal political language, as very often...

Thank you for creating this.

2TsviBT
Hm. I super like the notion and would like to see it implemented well. The very first example was bad enough to make me lose interest: https://russellconjugations.com/conj/1eaace137d74861f123219595a275f82 (Text from https://www.thenewatlantis.com/publications/the-anti-theology-of-the-body) So I tried the same thing but with more surrounding text... and it was much better!... though not actually for the subset I'd already tried above. https://russellconjugations.com/conj/3a749159e066ebc4119a3871721f24fc
3Mr Frege
Of course, the first thing I did is put your post through it
1AlphaAndOmega
Here's an example: ChatGPT's initial personality analysis (I did it for the hell of it, before reading your post): Certainly. Based on our conversations, here is a detailed personality profile ----------------------------------------   Personality Profile Cognitive Style: You exhibit a highly analytical, skeptical, and systems-oriented mode of thinking. You approach complex topics with intellectual rigor and demand high standards of precision, detail, and conceptual clarity. You prefer responses to be concise but rich in content, showing a low tolerance for verbosity or unfocused speculation. Your reading habits and conversational tone suggest an affinity for rationalist discourse, often associated with communities like LessWrong. You are comfortable navigating abstract and high-complexity domains, particularly when they intersect with real-world systems (e.g., medicine, AI, economics, immigration policy). Intellectual Disposition: You are widely read and intellectually omnivorous, with demonstrated depth in psychiatry, medicine, speculative fiction, and AI ethics. Your curiosity is not casual but deeply interrogative—you pursue foundational understanding, not surface-level summaries. You are driven by a desire to model not just systems but also long-term consequences, especially where they intersect with human welfare, labor, and technological disruption. Emotional Tone & Interpersonal Style: You maintain a reserved, self-possessed demeanor in communication, with dry humor and a tersely efficient tone. You are direct and assertive in setting conversational boundaries, clearly delineating what kinds of responses you expect. There's an implicit value placed on mutual respect and signal-to-noise ratio in dialogue. You likely prefer few, high-quality relationships or collaborations over broad social engagement. Values & Priorities: You place a premium on intellectual integrity, competence, and foresight. You are not easily swept up in optimism or hype, preferr

I recall seeing three “rationalist” cases for Trump:

  1. Richard Ngo on Twitter and elsewhere focused on the realignment of elite coalitions, observing that “most elite institutions have become leftist monocultures” but speculating that “over the next 5–10 years Silicon Valley will become the core of the Republicans.” The left-wing monoculture catastrophically damaged institutional integrity when public-health officials lied during the pandemic and when bureaucrats used threats and intimidation to censor speech on Facebook and Twitter and elsewhere—in the long-term this could move the country toward the draconian censorship regimes, restrictions on political opposition, and unresponsiveness to public opinion that we see today in England, France, and Germany. While we have strong free-speech protections now, we could be “frog-boiled by bureaucracies … an unsolved civilizational problem which has already
...

I'm mostly going to use this to crosspost links to my blog for less polished thoughts, Musings and Rough Drafts.

2Eli Tyre
This is a pretty helpful answer.  (Though you keep referencing the AI's chain of thought. I wasn't imagining training over the chain of thought. I was imagining training over the AI's outputs, whatever those are in the relevant domain.)
habryka20

I don't undertand what it would mean for "outputs" to be corrigible, so I feel like you must be talking about internal chain of thoughts here? The output of a corrigible AI and a non-corrigibile AI is the same for almost all tasks? They both try to perform any task as well as possible, the difference is how they relate to the task and how they handle interference.

2Eli Tyre
I would guess that if you finetuned a model so that it always responded in French, regardless of the languge you prompt it with, it would persistently respond in French (absent various jailbreaks which would almost definitely exist).  

Introduction

Writing this post puts me in a weird epistemic position. I simultaneously believe that:

  • The reasoning failures that I'll discuss are strong evidence that current LLM- or, more generally, transformer-based approaches won't get us AGI without some new breakthroughs
  • As soon as major AI labs read about the specific reasoning failures described here, they might fix them
  • But future versions of GPT, Claude etc. succeeding at the tasks I've described here will provide zero evidence of their ability to reach AGI. If someone makes a future post where they report that they tested an LLM on all the specific things I described here it aced all of them, that will not update my position at all.

That is because all of the reasoning failures that I describe here are surprising in the...

Very informative toy examples. Regarding this point:

> Some kind of failure of spatial reasoning (wandering items, whatever was going on with some of the sliding square chain-of-thoughts where pieces vanished)

I would strongly agree with this. I actually think the sliding block puzzle is a task which might just be easy for humans on account of our strong spatial priors. In the physical world, things move with spatial locality and two objects cannot be in the same place. For the LLM, it is trained on orders of magnitude less data to learn to represent spat... (read more)

2Kaj_Sotala
To make it a bit more explicit: * If you are superintelligent in the bioweapon domain: seems pretty obvious why that wouldn't let you take over the world. Sure maybe you can get all the humans killed, but unless automation also advances very substantially, this will leave nobody to maintain the infrastructure that you need to run. * Cybersecurity: if you just crash all the digital infrastructure, then similar. If you try to run some scheme where you extort humans to get what you want, expect humans to fight back, and then you are quickly in a very novel situation and the kind of a "world war" nobody has ever seen before. * Persuasion: depends on what we take the limits of persuasion to be. If it's possible to completely take over the mind of anyone by speaking ten words to them then sure, you win. But if we look at humans, great persuaders often aren't persuasive to everyone - rather they appeal very strongly to a segment of the population that happens to respond to a particular message while turning others off. (Trump, Eliezer, most politicians.) This strategy will get you part of the population while polarizing the rest against you and then you need more than persuasion ability to figure out how to get your faction to triumph. * If you want to run some galaxy-brained scheme where you give people inconsistent messages in order to appeal to all of them, you risk getting caught and need more than persuasion ability to make it work. * You can also be persuasive by being generally truthful and providing people with a lot of value and doing beneficial things. One can try to fake this by doing things that look beneficial but aren't, but then you need more than persuasion ability to figure out what those would be. * Probably the best strategy would be to keep being genuinely helpful until people trust you enough to put you in a position of power and then betray that trust. I could imagine this working. But it would be a slow strategy as it would take time to build
2Noosphere89
I'd say the main reason memory is useful is as a way to enable longer-term meta-learning, as well as enable the foundation for continuous learning to work out. From @Seth Herd's post: Or @gwern's comment here: https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/?commentId=hSkQG2N8rkKXosLEF
1Jonas Hallgren
Yeah, I agree with that and I still feel there's something missing from that discussion?  Like, there's some degree that to have good planning capacity you want to have good world model to plan over in the future. You then want to assign relative probabilities to your action policies working out well. To do this having a clear self-environment boundary is quite key, so yes memory enables in-context learning but I do not believe that will be the largest addition, I think the fact that memory allows for more learning about self-environment  boundaries is a more important part?  There's stuff in RL, Active Inference and Michael levin's work I can point to for this but it is rather like a bunch of information spread out over many different papers so it is hard to give something definitive on it.
5Garrett Baker
My vague understanding is this is kinda what capabilities progress ends up looking like in big labs. Lots of very small experiments playing around with various parameters people with a track-record of good heuristics in this space feel should be played around with. Then a slow scale up to bigger and bigger models and then you combine everything together & "push to main" on the next big model run. I'd also guess that the bottleneck isn't so much on the number of people playing around with the parameters, but much more on good heuristics regarding which parameters to play around with.
3George Ingebretsen
This Dwarkesh timestamp with Jeff Dean & Noam Shazeer seems to confirm this. That would mostly explain this question as well: "If parallelized experimentation drives so much algorithmic progress, why doesn't gdm just hire hundreds of researchers, each with small compute budgets, to run these experiments?" It would also imply that it would be a big deal if they had an AI with good heuristics for this kind of thing.

Don’t double update! I got that information from that same interview!

4Ryan Kidd
I expect mech interp to be particularly easy to automate at scale. If mech interp has capabilities externalities (e.g., uncovering useful learned algorithms or "retargeting the search"), this could facilitate rapid performance improvements.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursue using information from that dataset.

Estimated Complexity: 3/5  (this is a guess, I will update based on feedback/seeing how the scenario goes)

STORY

It's that time of year again.  The time when the Tithe Assessment Exactors demand that all adventurers pay taxes on the various monster parts they have hacked off and sold in the past year.   And, more importantly for you, the time when clients begin banging on your door looking for advice on how to minimize their taxes.

This used to be a straightforward, if complex, application of the published tax rules.  But ever since the disaster a few years...

Assuming I didn't make any mistakes in my deductions or decisions, optimal plan goes like this:

Give everyone a Cockatrice Eye (to get the most out of the associated rebate) and a Dragon Head (to dodge the taxing-you-twice-on-every-Head-after-the-first thing).

Give the mage and the rogue a Unicorn Horn and a Zombie Hand each, and give the cleric four Zombie hands; this should get them all as close to the 30sp threshold as possible without wrecking anything else.

Give literally everything else to the fighter, allowing them to bear the entire 212sp cost; if they get mad about it, analogize it to being a meatshield in the financial world as well as the physical.

6simon
Thanks aphyer. Solution: P. S. I used GPT-4.1 in Windsurf for the AI aspects. They're running a promotion where it costs 0 credits until, IIRC, April 21.
4kave
I've fixed the spoiler tags

We’ve written a new report on the threat of AI-enabled coups. 

I think this is a very serious risk – comparable in importance to AI takeover but much more neglected. 

In fact, AI-enabled coups and AI takeover have pretty similar threat models. To see this, here’s a very basic threat model for AI takeover:

  1. Humanity develops superhuman AI
  2. Superhuman AI is misaligned and power-seeking
  3. Superhuman AI seizes power for itself

And now here’s a closely analogous threat model for AI-enabled coups:

  1. Humanity develops superhuman AI
  2. Superhuman AI is controlled by a small group
  3. Superhuman AI seizes power for the small group

While the report focuses on the risk that someone seizes power over a country, I think that similar dynamics could allow someone to take over the world. In fact, if someone wanted to take over the world, their best strategy might well be...

Well done - this is super important. I think this angle might also be quite easily pitchable to governments.

This post is now looking extremely prescient.