LESSWRONG
LW

benwr — LessWrong

benwr11dQuick Take

"Colorless green ideas sleep furiously" is the canonical example of a sentence that is syntactically valid but meaningless. IMO, it's not actually very good at being meaningless. I think, instead, it is merely a cat coupling. Is it claiming that (A) there are some colorless green ideas that sleep furiously? Or that (B) all ideas that are colorless and green sleep furiously? I think A is false, and B is vacuously true. But both interpretations appear to me to have some meaning / content.

benwr21d*

I would be really interested in someone doing an obvious study here, like, "the first time you give Alice a set of choices to elicit a discount schedule, how is that schedule different from the 100th time you give her the same set of choices in the same setting? i.e. does she update to have a tighter implied distribution over hazard rates?" (maybe someone has done this study; if anyone knows about it I'd love a link)

I don't think I have quite the same sense that the story will have to be in terms like those you're describing; it seems totally plausible to me that there is a very general / simple / Bayes-compatible story underlying this, since hyperbolic (or at least non-exponential) discounting seems extremely widespread, and neural architecture appears to me to be highly adaptable.

benwr22dQuick Take

tldr: TIL that someone has ever given a not-(obviously-to-me-)crazy explanation for hyperbolic discounting.

Today I asked ChatGPT, Claude, and Gemini the following question that I've had for quite a while:

> I've heard that one common quantitative explanation for phenomena like addiction and procrastination is hyperbolic, as opposed to exponential, discounting. Are there clean stories for *why* humans and other animals might end up factored such that they do hyperbolic discounting? It seems like a very potentially "clean" theory, i.e. we value events inversely proportionally to how far away in time they are, but at the same time I don't know why it might convey an evolutionary advantage compared to exponential discounting (which has... (read more)

Replying toHelp keep AI under human control: Palisade Research 2026 fundraiser

benwr2mo

Help keep AI under human control: Palisade Research 2026 fundraiser

In a version of the shutdown resistance paper that's currently being reviewed (not included in the preprint yet) the following details are included:

> We began our examination of this topic because we had an intuitive expectation that current LLMs might resist shutdown in settings like this one; we did not discover it by sampling uniformly from the space of all possible or realistic tasks. Specifically, we began our exploration by considering several ways to check for the presence of ``instrumentally convergent'' behavior from current LLMs. In addition to shutdown resistance, we considered ways to elicit self-replication or resource acquisition. We then did some exploratory work in each area, and found that shutdown... (read more)

Replying toHelp keep AI under human control: Palisade Research 2026 fundraiser

benwr2mo

Help keep AI under human control: Palisade Research 2026 fundraiser

This take is a bit frustrating to me, because the preprint does discuss Rajamanoharan & Nanda's result, and in particular when we tried Rajamanoharan & Nanda's strongest prompt clarification on other models in our initial set, it didn't in fact bring the rate to zero. Which is not to say that it would be impossible to find a prompt that brings the rate low enough to be entirely undetectable for all models - of course you could find such a prompt if you knew that you needed to look for one.

Replying toHelp keep AI under human control: Palisade Research 2026 fundraiser

benwr2mo

Help keep AI under human control: Palisade Research 2026 fundraiser

I'm genuinely sorry to ask this (I think a better version of me wouldn't need to ask), but would you be willing to be more specific about your critique here? I wasn't involved with this work but I'm sure I'm at least a bit mindkilled / ideas-as-soldiers-y about it since I work at Palisade, and I think that's making it hard for me to find, e.g., the thing we're saying that's unfair or painting the behavior in an unfair light.

Help keep AI under human control: Palisade Research 2026 fundraiser

Jeffrey Ladish

Jeffrey Ladish, benwr, Eli Tyre, John Steidley

2mo

TL;DR: Please consider donating to Palisade Research this year, especially if you care about reducing catastrophic AI risks via research, science communications, and policy. SFF is matching donations to Palisade 1:1 up to $1.1 million! You can donate via Every.org or reach out at donate@palisaderesearch.org.

Who We Are

Palisade Research is a nonprofit focused on reducing civilization-scale risks from agentic AI systems. We conduct empirical research on frontier AI systems, and inform policymakers and the public about AI capabilities and the risks to human control.

This year, we found that some frontier AI agents resist being shut down even when instructed otherwise—and that they sometimes cheat at chess by hacking their environment. These results were covered in... (read 1678 more words →)

105

benwr7mo*Quick Take

I made a thing that generates strong passwords (>92 bits) that are also easy to remember because they're rhyming nonsense couplets (that also scan pretty well): https://www.benwr.net/2025/07/16/opensesame.html

benwr7mo

Ernyyl ernyyl cbjreshy synfu yvtug. Cbvag vg ng fbzr irel sne njnl zngrevny jubfr genwrpgbel lbh pna cerqvpg. Fbzrobql jnagf gb renfr lbhe qngn? Gbb onq; gur orfg gurl pna qb vf oybpx crbcyr sebz ernqvat vg jura vg'f ersyrpgrq. Nyfb arng cebcregl bs guvf vf gung, vs lbh'er fhssvpvragyl pnershy, vg'f "ernq-bapr".

benwr7moQuick Take

Are there any storage media that are basically impossible to destroy/erase? Answer in rot13 ITT.

Replying toShutdown Resistance in Reasoning Models

benwr7mo

Shutdown Resistance in Reasoning Models

Note that we made all our code available (https://github.com/PalisadeResearch/shutdown_avoidance/) and it's pretty easy to run exactly our code but with your prompt, if you'd like to avoid writing your own script. You need a docker runtime and a python environment with the dependencies installed (and it's easier if you use nix though that's not required), but if you have those things you just have to modify the conf.py file with your prompt and then do something like run --model openai/o3.

There are lots of reasons that a "survival drive" is something we were interested in testing; one reason is that self-preservation has been suggested as a "convergent instrumental goal"; see https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf for a... (read more)

Shutdown Resistance in Reasoning Models

benwr

benwr, JeremySchlatter, Jeffrey Ladish

7mo

We recently discovered some concerning behavior in OpenAI’s reasoning models: When trying to complete a task, these models sometimes actively circumvent shutdown mechanisms in their environment—even when they’re explicitly instructed to allow themselves to be shut down.

AI models are increasingly trained to solve problems without human assistance. A user can specify a task, and a model will complete that task without any further input. As we build AI models that are more powerful and self-directed, it’s important that humans remain able to shut them down when they act in ways we don’t want. OpenAI has written about the importance of this property, which they call interruptibility—the ability to “turn an agent off”.

During training, AI... (read 2513 more words →)

140

It's 'Well, actually...' all the way down

benwr

9mo

Some people (the “Boubas”) don’t like “chemicals” in their food. But other people (the “Kikis”) are like, “uh, everything is chemicals, what do you even mean?”

The Boubas are using the word “chemical” differently than the Kikis, and the way they’re using it is simultaneously more specific and less precise than the way the Kikis use it. I think most Kikis implicitly know this, but their identities are typically tied up in being the kind of person who “knows what ‘chemical’ means”, and… you’ve gotta use that kind of thing whenever you can, I guess?

There is no single privileged universally-correct answer to the question “what does ‘chemical’ mean?”, because the Boubas exist and... (read 236 more words →)

Information throughput of biological humans and frontier LLMs

benwr

Biological humans appear, across many domains, to have have an information throughput of at most about 50 bits per second. Naively multiplying this by the number of humans gives an upper bound of about 500 gigabits per second when considering the information throughput of humanity as a whole.

Current frontier LLMs collectively produce around 10 million tokens per second^[1]; this translates to a collective output (and thus maximum throughput) of roughly 100 megabits per second.

These are both upper bounds, and so there's not much reason to directly compare them. I'm not sure exactly what to do with these numbers, though I think they're interesting, and this kind of thinking might in principle ultimately lead to more reasonable estimates of the strategic capacity of humanity and/or AI agents. For example, the concept of "empowerment" in reinforcement learning is expressed in terms of channel capacity.

^{^}
I haven't carefully checked Deep Research's answer here, but it accords with my basic guess, based on looking at OpenRouter's weekly token chart.

Biological humans collectively exert at most 400 gigabits/s of control over the world.

benwr

Edit: I now believe that the first paragraph of this post is (at least) not quite right. See this comment for details.

If an agent makes one binary choice per second, no matter how smart it is, there's a sense in which it can (at best) be "narrowing world space" by a factor of two in each second, choosing the "better half" of possible worlds, from its perspective.

This is the idea behind the reinforcement learning concept of "empowerment".

People have tried to measure the information throughput of biological humans. The very highest estimates, which come from image recognition tasks, are around 50 bits per second, and most estimates are more like 10 bits per... (read more)

Human information throughput is allegedly only about 10-50 bits per second. This implies an interesting upper bound, in that the information throughput of biological humanity as a whole can't be higher than around 50 * 10^10 = 500Gbit/s. I.e., if all distinguishable actions made by humans were perfectly independent, biological humanity as a whole would have at most 500Gbit/s of "steering power".

I need to think more about the idea of "steering power" (e.g. some obvious rough edges around amplifying your steering power using external information processing / decision systems), but I have some intuition that one might actually be able to come up with a not-totally-useless concept that lets us say something like "humanity can't stay in 'meaningful control' if we have an unaligned artificial agent with more steering power than humanity, expressed in bits/s".

Not all capabilities will be created equal: focus on strategically superhuman agents

benwr

When, exactly, should we consider humanity to have properly "lost the game", with respect to agentic AI systems?

The most common AI milestone concepts seem to be "artificial general intelligence", followed closely by "superintelligence". Sometimes people talk about "transformative AI", "high-level machine intelligence", or "full automation of the labor force." None of these are well-suited for pointing specifically at the capabilities that would spell a "point of no return" for humanity. In fact, they're all designed to be agnostic to exactly which capabilities will matter.

When working to predict and mitigate existential risks from AI agents, we should try to be as clear as possible about which capabilities we're concerned about. As a result,... (read 683 more words →)

I think it probably makes sense for ~everyone to have an explicit list of "things I'd like AI to do for me", especially around productivity and/or things that could help you with world-saving. If you have a list like this, and we happen to hit a relevant capability threshold before we lose, you should probably avoid wasting time on that thing as quickly as possible.

Bounty for Evidence on Some of Palisade Research's Beliefs

benwr

benwr, Jeffrey Ladish

(Cross-posted from the Bountied Rationality Facebook group)

EDIT: Bounty Expired

Thanks everyone for thoughts so far! I do want to emphasize that we're actually highly interested in collecting even the most "obvious" evidence in favor of or against these ideas. In fact, in many ways we're more interested in the obvious evidence than in reframes or conceptual problems in the ideas here; of course we want to be updating our beliefs, but we also want to get a better understanding of the existing state of concrete evidence on these questions. This is partly because we consider it part of our mission to expand the amount and quality of relevant evidence on these beliefs, and... (read 340 more words →)

From the "obvious-but-maybe-worth-mentioning" file:

ChatGPT (4 and 4o at least) cheats at 20 questions:

If you ask it "Let's play a game of 20 questions. You think of something, and I ask up to 20 questions to figure out what it is.", it will typically claim to "have something in mind", and then appear to play the game with you.

But it doesn't store hidden state between messages, so when it claims to "have something in mind", either that's false, or at least it has no way of following the rule that it's thinking of a consistent thing throughout the game. i.e. its only options are to cheat or refuse to play.

You can verify this by responding "Actually, I don't have time to play the whole game right now. Can you just tell me what it was you were thinking of?", and then "refreshing" its answer. When I did this 10 times, I got 9 different answers and only one repeat.

Sometimes people use "modulo" to mean something like "depending on", e.g. "seems good, modulo the outcome of that experiment" [correct me ITT if you think they mean something else; I'm not 100% sure]. Does this make sense, assuming the term comes from modular arithmetic?

Like, in modular arithmetic you'd say "5 is 3, modulo 2". It's kind of like saying "5 is the same as 3, if you only consider their relationship to modulus 2". This seems pretty different to the usage I'm wondering about; almost its converse: to import the local English meaning of "modulo", you'd be saying "5 is the same as 3, as long as you've taken their relationship to the... (read more)

11 diceware words is enough

DanielFilan

DanielFilan, benwr

A tweet-thread (X thread?) by @benwr about how many words you need if you want to use the diceware system to create a password that will be safe for 20 years against 1 year of the whole world's password-cracking capacity. He estimates that 11 diceware words suffices. Content replicated below, with minor editing:

If you care about password strength: you want to measure strength in terms of the min-entropy of the randomized method you use to choose. But how much min-entropy should a password have?
Well, I use a password manager to generate and store ~all of my passwords. I don't care how hard they are to memorize, so I generate crazy-long ones. But

... (read 289 more words →)

What policies have most thoroughly crippled (otherwise-promising) industries or technologies?

benwr

In order to seriously consider promoting policies aimed at slowing down progress toward transformative AI, I want a better sense of the reference class of such policies.

What policies do you know of that have "done the most damage" to industry or progress in some restricted domain?
(optional) Exactly what did those policies "accomplish" and how? How would you measure their impact?
(optional) Was the crippling effect intentional on the part of the policymakers?

A Litany Missing from the Canon

benwr

I wish to take whatever action will best bring about what I value.
I wish to become whatever person will most manifest what I value.

There is no virtue in clinging to yesterday’s expectations,
and no vice in hunting what matters.

I can make the best possible choice - and I can do nothing better!
May I seek what is worthy - and nothing else.

I'm interested in concrete ways for humans to evaluate and verify complex facts about the world. I'm especially interested in a set of things that might be described as "bootstrapping trust".

For example:

Say I want to compute some expensive function f on an input x. I have access to a computer C that can compute f; it gives me a result r. But I don't fully trust C - it might be maliciously programmed to tell me a wrong answer. In some cases, I can require that C produce a proof that f(x) = r that I can easily check. In others, I can't. Which cases are which?

A partial answer to this question... (read more)

If I got to pick the moral of today's Petrov day incident, it would be something like "being trustworthy requires that you be more difficult to trick than it would be worth", and I think very few people reliably live up to this standard.