LESSWRONG
LW

All of J Bostock's Comments + Replies

The Asteroid Setup That Demands an Explanation

OK, so the issue here is that you've switched from a thermodynamic model of the gas atoms near the asteroid to one which ignores temperature at the shell. I'm not going to spend any more time on this because while it is fun, it's not a good use of time.

One of the properties of the second law is that if you can't find a single step in your mechanism which violates it, then the mechanism overall cannot violate it. Since you claim that every step in the process obeys the second law, the entire process must obey the second law. Even if I can't find the error I can say with near-certainty that there is one.

1David Björling8h

For anyone reading: Please note that I do not claim this is a perpetual motion machine. I do not claim the setup breaks the second law. In fact, I claim the opposite, and I even think I have found a mechanism where entropy does increase with the required scaling. I think the answer to why the second law doesn't break may be important and interesting.

The Asteroid Setup That Demands an Explanation

J Bostock9h20

Why do you think gas will accumulate close to the shell? This is not how gases work, the gas will form an equilibrium density gradient with zero free energy to exploit.

1David Björling9h

Here’s why I believe a slight density increase near the shell is not only possible but statistically inevetable: 1. The shell can be placed sufficiently far from the asteroid such that escaping helium atoms travel nearly radially from the surface. Most will have very low kinetic energy, having just barely escaped the gravitational potential well. 2. Sufficiently far from the asteroid, these atoms feel almost no gravity. They’re essentially coasting in near-inertial trajectories. 3. If the shell were absent, these atoms would simply escape into space. But with the shell in place, their outward motion is halted. They bounce off the shell instead of escaping. 4. Since their approach is slow and nearly radial, many will strike the shell with low momentum. After bouncing, some will scatter at non-radial angles and may linger in the vicinity. Given enough atoms and time, this creates a diffuse accumulation zone close to the shell. A mild, geometry-induced density spike. 5. This isn't a violation of thermodynamic equilibrium. It’s a boundary condition effect arising from system geometry and the kinematics of slow-moving atoms arriving at a barrier. The system as a whole trends toward equilibrium, but that equilibrium includes local features shaped by containment. Also, we do not have to wait for equilibrium before collecting atoms. It is enough that we calculate that at some point in time there will be a density spike close to the shell. So yes, under these assumptions, some degree of helium accumulation near the shell is to be expected.

The Asteroid Setup That Demands an Explanation

J Bostock11h60

Actually I think this fails for ordinary reasons. Key question: how are you getting energy out of lowering the helium?

If you mean the helium is chemically bound to the sheets (through adsorption) then you'll need to use energy to release it
If you mean the helium is trapped in balloons, then it will be neutrally buoyant in the ambient helium atmosphere unless you expend energy to compress it.

1David Björling10h

Big thank you! I made an edit to the post, clarifying this point (the one in my previous reply). Do you think I need to address the point of enclosing the gas directly?

1David Björling11h

Ah! Good question. Perhaps I should have given more details. I think enclosing the helium (without compression) is the way to go. And the density will spike close to the shell (gas comming from the asteroid will accumulate there). You will this have: *The atmosphere, with the greatest density. *The vicinity of the shell wall, with a spike in density. *Space in between, very close to a vacuum. You will be able to lower the enclosed helium through a lot of space without any bouyancy, turning potential energy into work.

Applying right-wing frames to AGI (geo)politics

J Bostock1d00

These are interesting, but I think you're fundamentally misunderstanding how power works in this case. The main questions are not "Which intellectual frames work?" but "What drives policy?". For the Democrats in the US, it's often The Groups: a loose conglomeration of dozens to hundreds of different think-tanks, nonprofits, and other orgs. These in turn are influenced by various sources including their own donors but also including academia. This lets us imagine a chain of events like:

Serious Academic Papers are published arguing that AGI is an extinction

J Bostock1d40

I am a bit cautious of dismissing all of those ideas out-of-hand; while I am tempted to agree with you, I don't know of a strong case that these words definitely don't (or even probably don't point) to anything in the real world. Therefore, while I can't see a consistent, useful definition of them, it's still possible that one exists (c.f. Free Will which people often get confused about, but for which there exists a pretty neat solution) so it's not impossible that any given report contains a perfectly satisfying model which explains my own moral intuition... (read more)

You Can't Objectively Compare Seven Bees to One Human

J Bostock2d72

You seem to be arguing "your theory of moral worth is incomplete, so I don't have to believe it". Which is true. But without presenting a better or even different theory of moral worth, it seems like you're mostly just doing that because you don't want to believe it.

I would overall summarize my views on the numbers in the RP report as "These provides zero information, you should update to where you would be before you read them." Of course you can still update on the fact that different animals have complex behaviour, but then you'll have to make the case ... (read more)

2Seth Herd2d

All good points. I agree that you need an argument for "you should consider bees to be morally important because they can count and show social awareness" I was filling that argument in. To me it seems intuitive and a reasonable baseline assumption, but it's totally reasonable that it doesn't seem that way to you. (It's the same argument I make in a comment justifying neuron count as very rough proxy for moral consideration I in response to Kaj Sotala's related short form. I do suspect that in this case many of bees cognitive abilities do not correlate with whatever-you-want-to-call-consciousness/sentience in the same way they would in mammals, which is one of the reasons I'll continue eating honey occasionally.) Agreed that trying to insist on a Schelling or anchor point is bad argumentation without a full justification. How much justification it needs is in the eye of the beholder. It seems reasonable to me for reasons to complex to go into, and reasonable that it doesn't to you since you don't share those background assumptions/reasoning.

You Can't Objectively Compare Seven Bees to One Human

J Bostock2d40

For your second part, whoops! I meant to include a disclaimer that I don't actually think BB is arguing in bad faith, just that his tactics cash out to being pretty similar to lots of people who are, and I don't blame people for being turned off by it.

2Seth Herd2d

Thanks, that makes sense. Perhaps I'm being a bit naive; I've avoided the worst parts of the internet :) I guess I think of arguing in bad faith as being on a continuum, and mostly resulting from motivated reasoning and not having good theories about what clear/fair argumentation is. I think it's pretty rare for someone's faith to be so bad that they're thinking "I'll lie/cheat to win this argument" - although I'm sure this does happen occasionally. I think most things that look like really bad faith are a product of it being really easy to fool yourself into thinking you're making a good valid argument, particularly if you're moving fast or irritated.

You Can't Objectively Compare Seven Bees to One Human

J Bostock2d111

On some level, yes it is impossible to critique another person's values as objectively wrong, utility functions in general are not up for grabs.

If person A values bees at zero, and person B values them at equivalent to humans, then person B might well call person A evil, but that in and of itself is a subjective (and let's be honest, social) judgement aimed at person A. When I call people evil, I'm attempting to apply certain internal and social labels onto them in order to help myself and others navigate interactions with them, as well as create better de... (read more)

If you want to be vegan but you worry about health effects of no meat, consider being vegan except for mussels/oysters

J Bostock3d20

I did a spot check since bivalves are filter feeders and so can accumulate contaminants more than you might expect. Mussels and oysters are both pretty low in mercury, hopefully this extends to other contaminants.

A case for courage, when speaking of AI danger

J Bostock5d75

But if you instead start by addressing things like job risks, deepfakes, concentration of power and the totalitarianism, tangible real issues people can see now, they may begin to open that door and then be more susceptible to discussing and acting on existential risk because they have the momentum behind them.

I spent approximately a year at PauseAI (Global) soul-searching over this, and I've come to the conclusion that this strategy is not a good idea. This is something I've changed my mind about.

My original view was something like:

"If we conv... (read more)

1Outsideobsserver1d

Hi there! I apologize for not responding to this very insightful comment, I really appreciate your perspective on my admittedly scatter brained thought parent comment. Your comment definitely has caused me to reflect a-bit on my own, and updated me away slightly from my original position. I feel I may have been a bit ignorant to the actual state of PauseAI, as like I said in my original comments and replies it felt like an organization dangerously close to becoming orphaned from people’s thought processes. I’m glad to hear there are some ways around the issue I described. Maybe write a top level post about how this shift in understanding is benefiting your messaging to the general public? It may inform others of novel ways to spread a positive movement.

A case for courage, when speaking of AI danger

J Bostock5d70

How would you recommend pushing this book's pile of memes in China? My first thought would be trying to organize some conference with non-Chinese and (especially) Chinese experts, the kinds who advise the government, centered around the claims of the book. I don't know how the CCP would view this though, I'm not an expert on Chinese internal politics.

1Three-Monkey Mind5d

I have no idea whatsoever, sorry.

Kaj's shortform feed

J Bostock7d82

I think the question is less "Why do we think that the objective comparison between these things should be anchored on neuron count?" And more like "How do we even begin to make a subjective value judgement between these things".

In that case, I would say that when an animal is experiencing pleasure/pain, that probably takes the form of information in the brain. Information content is roughly equivalent to neuron count. All I can really say is that I want less suffering-like information processing in the universe.

Aether July 2025 Update

J Bostock7d20

Thanks for the clarification, that's good to hear.

Aether July 2025 Update

J Bostock7d20

Our funder wishes to remain anonymous for now.

This is suspicious. There might be good reasons, but given the historical pattern of:

Funder with poor track record on existential safety funds new "safety" lab
"Safety" lab attracts well-intentioned talent
"Safety" lab makes everything worse

I'm worried that your funder is one of the many, many people with financial stakes in capabilities and reputational stakes in pretending to look good. The specific research direction does not fill me with hope on this front, as it kinda seems like something a frontier lab might want to fund.

4RohanS7d

Our funder is not affiliated with a frontier lab and has provided us support with no expectation of financial returns. We have also had full freedom to shape our research goals (within the broad agreed-upon scope of “LLM agent safety”).

Aether July 2025 Update

J Bostock7d30

We are open to feedback that might convince us to focus on these directions instead of monitorability.

What is the theory of impact for monitorability? It seems to be an even weaker technique than mechanistic interpretability, which has at best a dubious ToI.

Since monitoring is pretty superficial, it doesn't give you a signal which you can use to optimize the model directly (that would be The Most Forbidden Technique) in the limit of superintelligence. My take on monitoring is that at best it allows you to sound an alarm if you catch your AI doing something... (read more)

1RohanS7d

Our ToI includes a) increasing the likelihood that companies and external parties notice when monitorability is degrading and even attempt interventions, b) finding interventions that genuinely enhance monitorability, as opposed to just making CoTs look more legible, and c) lowering the monitorability tax associated with interventions in b. Admittedly we probably can’t do all of these at the same time, and perhaps you’re more pessimistic than we are that acceptable interventions even exist. There are certainly some senses in which CoT monitoring is weaker than mech interp: all of a model’s cognition must happen somewhere in its internals, and there’s no guarantee that all the relevant cognition appears in CoT. On the other hand, there are also important senses in which CoT monitoring is a stronger technique. When a model does a lot of CoT to solve a math problem, reading the CoT provides useful insight into how it solved the problem, and trying to figure that out from model internals alone seems much more complicated. We think it’s quite likely this transfers to safety-relevant settings like a model figuring out how to exfiltrate its own weights or subvert security measures. We agree that directly training against monitors is generally a bad idea. We’re uncertain about whether it’s ever fine to optimize reasoning chains to be more readable (which need not require training against a monitor), though it seems plausible that there are techniques that can enhance monitorability without promoting obfuscation (see Part 2 of our agenda). More confidently, we would like frontier labs to adopt standardized monitorability evals that are used before deploying models internally and externally. The results of these evals should go in model cards and can help track whether models are monitorable. Our primary aim is to make ~human-level AIs more monitorable and trustworthy, as we believe that more trustworthy early TAIs make it more realistic that alignment research can be saf

The best simple argument for Pausing AI?

J Bostock8d64

I think there's a couple of missing pieces here. Number 1 is that reasoning LLMs can be trained to be very competent in rewardable tasks, such that we can generate things which are powerful actors in the world, but we still can't get them to follow the rules we want them to follow.

Secondly, if we don't stop now, seems like the most likely outcome is we just die. If we do stop AI development, we can try and find another way forward. Our situation is so dire we should stop-melt-catch-fire on the issue of AI.

[Paper] Stochastic Parameter Decomposition

J Bostock10d20

I think it's the other way around. If you try to implement computation in superposition in a network with a residual stream, you will find that about the best thing you can do with the Wout is often to just use it as a global linear transformation. Most other things you might try to do with it drastically increases noise for not much pay-off. In the cases where networks are doing that, I would want SPD to show us this global linear transform.

What do you mean by "a global linear transformation" as in what kinds of linear transformations are there other than... (read more)

2Lucius Bushnaq10d

Linear transformations that are the sum of weights for different circuits in superposition, for example. What I am trying to say is that I expect networks to implement computation in superposition by linearly adding many different subcomponents to create W_in, but I mostly do not expect networks to create W_out by linearly adding many different subcomponents that each read-out a particular circuit output back into the residual stream, because that's actually an incredibly noisy operation. I made this mistake at first as well. This post still has a faulty construction for W_out because of my error. Linda Linsefors finally corrected me on this a couple months ago. I disagree that if all we're doing is applying a linear transformation to the entire space of superposed features, rather than, say, performing different computations on the five different features, that it would be desirable to split this linear transformation into the five features. Uh, I think this would be a longer discussion than I feel up for at the moment, but I disagree with your prediction. I agree that the representational geometry in the model will be important and that it will be set up to help the model, but interference of circuits in superposition cannot be arranged to be helpful in full generality. If it were, I would take that as pretty strong evidence that whatever is going on in the model is not well-described by the framework of superposition at all.

[Paper] Stochastic Parameter Decomposition

J Bostock10d40

I went back to read Compressed Computation is (probably) not Computation in Superposition more thoroughly, and I can see that I've used "superposition" in a way which is less restrictive than the one which (I think) you use. Every usage of "superposition" in my first comment should be replaced with "compressed computation".

Networks have non-linearities. SPD will decompose you a matrix into a single linear transformation if what the network is doing with that matrix really is just applying one global linear transformation. If e.g. there are non-linearities

... (read more)

2Lucius Bushnaq10d

SPD can decompose an n×m matrix into more than max(n,m) subcomponents. I guess there aren't any toy models in this paper that directly showcase this, but I'm pretty confident it's true, because 1. I don't see why it wouldn't be able to. 2. I've decomposed a weight matrix in a tiny LLM and got out way more than max(n,m) live subcomponents. That's a very preliminary result though, you probably shouldn't put that much stock in it. I think it's the other way around. If you try to implement computation in superposition in a network with a residual stream, you will find that about the best thing you can do with the Wout is often to just use it as a global linear transformation. Most other things you might try to do with it drastically increases noise for not much pay-off. In the cases where networks are doing that, I would want SPD to show us this global linear transform. They're embedded randomly in the space, so there is interference between them in the sense of them having non-zero inner products. Yes. I agree that this makes the model not as great a testbed as we originally hoped.

dmac_93's Shortform

J Bostock10d40

This is also just not really true. Natural Selection (as opposed to genetic drift) can maintain genetic variations especially for things like personality, due to the fact that "optimal" behavioural strategies depend on what others are doing. Any monoculture of behavioural strategies is typically vulnerable to invasion by a different strategy. The equilibrium position is therefore mixed. It's more common for this to occur due to genetic variation than due to each individual using a mixed strategy.

Furthermore, humans have undergone rapid environmental change in recent history, which will have selected for lots of different behavioural traits at different times. So we're not even at equilibrium.

[Paper] Stochastic Parameter Decomposition

J Bostock10d20

I generally expect the authors of this paper to produce high-quality work, so my priors are on me misunderstanding something. Given that:

I don't see how this method does anything at all, and the culprit is the faithfulness loss.

As you have demonstrated, if a sparse, overcomplete set of vectors is subject to a linear transformation, then SPD's decomposition basically amounts to the whole linear transformation. The only exception is if your sparse basis is not overcomplete in the larger of the two dimensions, in which case SPD finds that basis in the smaller... (read more)

4Lucius Bushnaq10d

No, that's not how it works 1. Networks have non-linearities. SPD will decompose you a matrix into a single linear transformation if what the network is doing with that matrix really is just applying one global linear transformation. If e.g. there are non-linearities right after the matrix that aren't just always switched on, SPD will usually decompose the matrix into many sub-components.[1] 2. I'm not sure what you mean by 'working on just one weight at a time'. The stochastic-layerwise reconstruction loss does do forward passes replacing only one matrix in the network at a time with a randomly ablated version of the same matrix. But the stochastic reconstruction loss does forward passes replacing all matrices at once. But I think I must be misunderstanding what you mean here, because even if we didn't have the stochastic reconstruction loss I don't see how that would matter for this. That's not how it works in our existing framework for circuits in superposition. The weights for particular circuits there actually literally do sum to the weights of the whole network. I've been unable to come up with a general framework that doesn't exhibit this weight linearity. I wouldn't say that? Computation in superposition inevitably involves different circuits interfering with each other, because the weights of one circuit have non-zero inner product with the activations of another. But there is still a particular set of vectors in parameter space such that each vector implements one circuit. Superposition can give you an overcomplete basis of variables in activation space, but it cannot give you an overcomplete basis of circuits acting on these variables in parameter space. There can't be more circuits than weights. 1. ^ Well, depending on what the network is actually computing with these non-linearities, of course. If it's not computing many different things, or not using the results of many of the computations for anything downstream, SPD probably wo

Jemist's Shortform

J Bostock11d73

https://threadreaderapp.com/thread/1925593359374328272.html

Reading between the lines here, Opus 4 was RLed by repeated iterating and testing. Seems like they had to hit it fairly (for Anthropic) hard with the "Identify specific bad behaviors and stop them" technique.

Relatedly: Opus 4 doesn't seem to have the "good vibes" that Opus 3 had.

Furthermore, this (to me) indicates that Anthropic's techniques for model "alignment" are getting less elegant and sophisticated over time, since the models are getting smarter---and thus harder to "align"---faster than Ant... (read more)

dmac_93's Shortform

J Bostock11d710

No need to invoke epigenetics, the answer is that 2 is false. Who is claiming 2 to be the case?

Humans clearly have large genetic variation in physical traits, why would mental traits be an exception?

1dmac_9310d

I think a better argument than #2 would be that evolution tends to remove genetic variantions.

Help the AI 2027 team make an online AGI wargame

J Bostock13d90

Do you have plans in place for music? I'm a rather decent music writer in the domain of video-game-ish music. I can certainly do better than AI generated music, I think AI music generation is really quite bad at the moment. Music with solid/catchy themes can do wonders for the experience---and popularity!

(plus honestly it's been a while since I've gotten to write anything and I'd enjoy doing something creative with a purpose)

5Jonas V12d

No current plans, and not sure if it should have music at all. But it's good to know we can reach out to you if that becomes important; appreciate you letting us know!

Love Island USA Season 7 Episode 20: What Could The Producers Be Thinking

J Bostock13d51

I strong-upvoted this post, because this is exactly The Content I Want Here. I think the point of rationality is to apply The Methods until they become effortless; the major value proposition of a rationality community is to be in a place where people apply The Methods to everything. Seeing others apply The Methods by default causes our socially driven brains to do the same. Super-ego, or Sie if you like TLP. Everything includes Love Island. One of my favourite genres of post is "The Methods applied to unusual thing".

johnswentworth's Shortform

J Bostock13d60

Median party conversation is probably about as good as playing a video game I enjoy, or reading a good blog post. Value maybe £2/hr. More tiring than the equivalent activity.

Top 10% party conversation is somewhere around going for a hike somewhere very beautiful near to where I live, or watching an excellent film. Value maybe £5/hr. These are about as tiring as the equivalent activity.

Best conversations I've ever had were on par with an equal amount of time spent on a 1/year quality holiday, like to Europe (I live in the UK) but not to, say, Madagascar. Mo... (read more)

Lurking in the Noise

J Bostock14d20

That is a fair point.

Distillation Robustifies Unlearning

J Bostock21d20

I think we can go further than this with distillation. One question I have is this: if you distil from a model which is already 'aligned' do you get an 'aligned' model out of it?

Can you use this to transfer 'alignment' from a smaller teacher to a larger student, then do some RL to bring the larger model up in performance. This would get around the problem we currently have, where labs have to first make a smart unaligned model, then try and wrestle it into shape.

2Addie Foote21d

I think it depends on what kind of 'alignment' you're referring to. Insofar as alignment is a behavioral property (not saying bad things to users, not being easily jail breakable) I think our results weakly suggest that this kind of alignment would transfer and perhaps even get more robust. One hypothesis is that pretrained models learn many 'personas' (including 'misaligned' ones) and post training shapes/selects a desired persona. Maybe distilling the post trained model would only, or primarily, transfer the selected persona and not the other ones. I don't think we can draw conclusions yet, but it sounds like an interesting idea for further work! Though it would be expensive to distill a large post trained model, it could be more tractable to find an open source one and evaluate various alignment properties compared to the teacher. However, for more intrinsic alignment properties (is the model scheming, does the model have a misaligned goal), it's less clear how they might develop in the first place. I'm not sure whether distillation would reliably transfer these properties. Also importantly, I would be concerned that misalignment could emerge during the RL process or any further training.

Jemist's Shortform

J Bostock22d20

Hypothesis: one type of valenced experience---specifically valenced experience as opposed to conscious experience in general, which I make no claims about here---is likely to only exist in organisms with the capability for planning. We can analogize with deep reinforcement learning: seems like humans have a rapid action-taking system 1 which is kind of like Q-learning, it just selects actions; we also have a slower planning-based system 2, which is more like value learning. There's no reason to assign valence to a particular mental state if you're not able to imagine your own future mental states. There is of course moment-to-moment reward-like information coming in, but that seems to be a distinct thing to me.

Ok, AI Can Write Pretty Good Fiction Now

J Bostock23d2-2

I prefer Opus 3's effort to Opus 4's. I have found Opus 4 to be missing quite a bit of the Claude charm and skill. Anthropic have said it went through a lot of rounds of RL to stop it being deceptive and scheming. Perhaps their ability to do light-touch RL that gets models to be have but doesn't mode collapse the model too much doesn't extend to this capability level.

Jemist's Shortform

J Bostock1mo*201

The latest recruitment ad from Aiden McLaughlin tells a lot about OpenAI's internal views on model training:

My interpretation of OpenAI's worldview, as implied by this, is:

Inner alignment is not really an issue. Training objectives (evals) relate to behaviour in a straightforward and predictable way.
Outer alignment kinda matters, but it's not that hard. Deciding the parameters of desired behaviour is something that can be done without serious philosophical difficulties.
Designing the right evals is hard, you need lots of technical skill and high taste to ma

... (read more)

1sjadler1mo

Re: 1, during my time at OpenAI I also strongly got the impression that inner alignment was way underinvested. The Alignment team’s agenda seemed basically about better values/behavior specification IMO, not making the model want those things on the inside (though this is now 7 months out of date). (Also, there are at least a few folks within OAI I’m sure know and care about these issues)

8evhub1mo

Link is here, if anyone else was wondering too.

johnswentworth's Shortform

J Bostock1mo84

There's two parts here.

Are people using escalating hints to express romantic/sexual interest in general?
Does it follow the specific conversational patterns usually used?

1 is true in my experience, while 2 usually isn't. I can think of two examples where I've flirted by escalating signals. In both cases it was more to do with escalating physical touch and proximity, though verbal tone also played a part. I would guess that the typical examples of 2 you normally see (like A complimenting B's choice of shoes, then the B using a mild verbal innuendo, then A ma... (read more)

Do you even have a system prompt? (PSA / repo)

J Bostock1mo40

Something I've found really useful is to give Claude a couple of examples of Claude-isms (in my case "the key insight" and "fascinating") and say "In the past, you've over-used these phrases: [phrases] you might want to cut down on them". This has shifted it away from all sorts of Claude-ish things, maybe it's down-weighting things on a higher level.

5gwern1mo

Seems similar to the "anti-examples" prompting trick I've been trying: taking the edits elicited from a chatbot, and reversing them to serve as few-shot anti-examples of what not to do. (This would tend to pick up X-isms.)

Winning the power to lose

J Bostock1mo20

Even if ~all that pausing does is delay existential risk by 5 years, isn't that still totally worth it? If we would otherwise die of AI ten years from now, then a pause creates +50% more value in the future. Of course it's a far cry from all 1e50 future QALYs we maybe could create, but I'll take what I can get at this point. And a short-termist view would hold that even more important.

4Matthew Barnett1mo

I agree that delaying a pure existential risk that has no potential upside—such as postponing the impact of an asteroid that would otherwise destroy complex life on Earth—would be beneficial. However, the risk posed by AI is fundamentally different from something like an asteroid strike because AI is not just a potential threat: it also carries immense upside potential to improve and save lives. Specifically, advanced AI could dramatically accelerate the pace of scientific and technological progress, including breakthroughs in medicine. I expect this kind of progress would likely extend human lifespans and greatly enhance our quality of life. Therefore, if we delay the development of AI, we are likely also delaying these life-extending medical advances. As a result, people who are currently alive might die of aging-related causes before these benefits become available. This is a real and immediate issue that affects those we care about today. For instance, if you have elderly relatives whom you love and want to see live longer, healthier lives, then—assuming all else is equal—it makes sense to want rapid medical progress to occur sooner rather than later. This is not to say that we should accelerate AI recklessly and do it even if that would dramatically increase existential risk. I am just responding to your objection, which was premised on the idea that delaying AI could be worth it even if delaying AI doesn't reduce x-risk at all.

D&D.Sci: The Choosing Ones

J Bostock2mo30

I appreciate your analysis. It's was fun to try my best and then check your comments for the real answer, moreso than just getting it from the creator.

D&D.Sci: The Choosing Ones

J Bostock2mo20

OK: so, based on doing a bunch of calibration plots, mutual information plots, and two-way scatter plots to compare candidates, this is what I have.

Candidate 11 is the best choice. 7 and 34 are my second choices, though 19 also looks pretty good.

Holly gives the most information, she's the best predictor overalll, followed by Ziqual. Amy is literally useless. Colleen and Linestra are equivalent. Holly and Ziqual both agree on candidate 11, so I'll choose them.

Interestingly, some choosers like to rank clusters of individuals at exactly the same value, and it

... (read more)

Jemist's Shortform

J Bostock2mo20

Heuristic explanation for why MoE gets better at higher model size:

The input/output of a feedforward layer is equal to the model_width, but the total size of weights grows as model_width squared. Superposition helps explain how a model component can make the most use of its input/output space (and presumably its parameters) using sparse overcomplete features, but in the limit, the amount of information accessed by the feedforward call scales with the number of active parameters. Therefore at some point, more active parameters won't scale so well, since you're "accessing" too much "memory" in the form of weights, and overwhelming your input/output channels.

Gemini Diffusion: watch this space

J Bostock2mo147

My understanding was that diffusion refers to a training objective, and isn't tied to a specific architecture. For example OpenAI's Sora is described as a diffusion transformer. Do you mean you expect diffusion transformers to scale worse than autoregressive transformers? Or do you mean you don't think this model is a transformer in terms of architecture.

6Alice Blair2mo

Oops, I wrote that without fully thinking about diffusion models. I meant to contrast diffusion LMs to more traditional autoregressive language transformers, yes. Thanks for the correction, I'll clarify my original comment.

Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies

J Bostock2mo134

Are you American? Because as a British person I would say that the first version looks a lot better to me, and certainly fits the standards for British non-fiction books better.

Though I do agree that the subtitle isn't quite optimal.

9philh2mo

Huh, I'm also British and I thought the first version looked like a placeholder, as in "no one's uploaded an actual cover yet so the system auto generates one". The only thing making me think not-that was that the esc key is mildly relevant. I bought the second one partly because I was a lot more confident I was actually buying a real book. I guess part of what's going on here is it's the same grey as the background (or very close?), so looks transparent. But even without that I think I'd have had a similar reaction.

Richard_Kennaway2mo115

I am British. I'm not much impressed by either graphic design, but I'm not a graphic designer and can't articulate why.

Jan Betley's Shortform

J Bostock2mo100

Might be to avoid people stealing the unembedding matrix weights.

Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

J Bostock2mo192

Out of domain (i.e. on a different math benchmark) the RLed model does better at pass@256, especially when using algorithms like RLOO and Reinforce++. If there is a crossover point it is in the thousands. (Figure 7)

This seems critically important. Production models are RLed on hundreds to thousands of benchmarks.

We should also consider that, well, this result just doesn't pass the sniff test given what we've seen RL models do. o3 is a lot better than o1 in a way which suggests that RL budgets do scale heavily with xompute, and o3 if anything is better at s... (read more)

1Aaron_Scher2mo

FWIW, I interpret the paper to be making a pretty narrow claim about RL in particular. On the other hand, a lot of the production "RL models" we have seen may not be pure RL. For instance, if you wanted to run a similar test to this paper on DeepSeek-V3+, you would compare DeepSeek-V3 to DeepSeek-R1-Zero (pure RL diff, according to the technical report), not to DeepSeek-R1 (trained with a hard-to-follow mix of SFT and RL). R1-Zero is a worse model than R1, sometimes by a large margin.

6Thane Ruthenis2mo

If you're referring to the ARC-AGI results, it was just pass@1024, for a nontrivial but not startling jump (75.7% to 87.5%). About the same ballpark as in the paper, plus we don't actually know how much better its pass@1024 was than its pass@256. The costs aren't due to an astronomical k, but due to it writing a 55k-token novel for each attempt plus high $X/million output tokens. (Apparently the revised estimate is $600/million??? It was $60/million initially.) (FrontierMath was pass@1. Though maybe they used consensus@k instead (outputting the most frequent answer out of k, with only one "final answer" passed to the task-specific verifier) or something.)

9Jozdien2mo

o3 may also have a better base model. o3 could be worse at pass@n for high n relative to its base model than o1 is relative to its base model, while still being better than o1. I don't think you need very novel RL algorithms for this either - in the paper, Reinforce++ still does better for pass@256 in all cases. For very high k, pass@k being higher for the base model may just imply that the base model has a broader distribution to sample from, while at lower k the RL'd models benefit from higher reliability. This would imply that it's not a question of how to do RL such that the RL model is always better at any k, but how to trade off reliability for a more diverse distribution (and push the Pareto frontier ahead).

2Thomas Kwa2mo

Agree, I'm pretty confused about this discrepancy. I can't rule out that it's just the "RL can enable emergent capabilities" point.

$500 Bounty Problem: Are (Approximately) Deterministic Natural Latents All You Need?

J Bostock2mo*40

OK so some further thoughts on this: suppose we instead just partition the values of $Λ$ directly by something like a clustering algorithm, based on $D_{K L}$ in $P [X | Λ]$ space, and take $Δ (Λ)$ just be the cluster that $λ$ is in:

Assuming we can do it with small clusters, we know that $P [X | Λ] \approx P [X | Δ]$ is pretty small, so $D_{K L} (P [X] | | P [X | Δ])$ is also small.

And if we consider $X_{2} \leftarrow X_{1} \to Λ$ , this tells us that learning $X_{1}$ restricts us to a pretty small region of $P [X_{2}]$ space (since $P [X_{2} | X$ ... (read more)

Jemist's Shortform

J Bostock2mo244

Too Early does not preclude Too Late

Thoughts on efforts to shift public (or elite, or political) opinion on AI doom.

Currently, it seems like we're in a state of being Too Early. AI is not yet scary enough to overcome peoples' biases against AI doom being real. The arguments are too abstract and the conclusions too unpleasant.

Currently, it seems like we're in a state of being Too Late. The incumbent players are already massively powerful and capable of driving opinion through power, politics, and money. Their products are already too useful and ubiquitous t... (read more)

5Seth Herd2mo

I like this framing; we're both too early and too late. But it might transition quite rapidly from too early to right on time. One idea is to prepare strategies and arguments and perhaps prepare the soil of public discourse in preparation for the time when it is no longer too early. Job loss and actually harmful AI shenanigans are very likely before takeover-capable AGI. Preparing for the likely AI scares and negative press might help public opinion shift very rapidly as it sometimes does (e.g., COVID opinions went from no concern to shutting down half the economy very quickly). The average American and probably the average global citizen already dislikes AI. It's just the people benefitting from it that currently like it, and that's a minority. Whether that's enough is questionable, but it makes sense to try and hope that the likely backlash is at least useful in slowing progress or proliferation somewhat.

The Way You Go Depends A Good Deal On Where You Want To Get: FEP minimizes surprise about actions using preferences about the future as *evidence*

J Bostock2mo20

Under this formulation, FEP is very similar to RL-as-inference. But RL-as-inference is a generalization of a huge number of RL algorithms from Q-learning to LLM fine-tuning. This does kind of make sense if we think of FEP as a just a different way of looking at things, but it doesn't really help us narrow down the algorithms that the brain is actually using. Perhaps that's actually all FEP is trying to do though, and Friston has IIRC said things to that effect---that FEP is just a reframing/generalization and not an actual model of the underlying algorithms being employed.

8Yldedly2mo

There are some conceptual differences. In RL, you define a value function for all possible states. In active inference, you make desirable sense data a priori likely. Sensory space is not only lower-dimensional than (unobserved) state space, but you only need to define a single point in it, rather than a function on the whole space. It's often a much more natural way of defining goals and is more similar to control theory than RL. You're directly optimizing for a desired (and known) outcome rather than having to figure out what to optimize for by reinforcement. For example, if you want a robot to walk to some goal point, RL would have to make the robot walk around a bit, figure out that the goal point gives high reward, and then do it (in another rollout). In active inference (and control theory), the robot already knows where the goal point is (or rather, what the world looks like when standing at that point), and merely figures out a sequence of actions that get it there. Another difference is that active inference automatically balances exploration and exploitation, while in RL it's usually a hyperparameter. In RL, it tends to look like doing many random actions early on, to figure out what gives reward, and later on do actions that keep the agent in high-reward states. In control theory, exploration is more bespoke, and built specifically for system identification (learning a model) or adaptive control (adjusting known parameters based on observations). In active inference, there's no aimless flailing about, but the agent can do any kind of experiment that minimizes future uncertainty - testing what beliefs and actions are likely to achieve the desired sense data. Here's a nice demo of that:

1Christopher King2mo

Yeah my understanding is that FEP is meant to be quite general, the P and Q are doing a lot of the theory's work for it. Chapter 5 describes how you might apply it to the human brain in particular.

MichaelDickens's Shortform

J Bostock3mo73

This seems not to be true assuming a P(doom) of 25% and a purely selfish perspective, or even a moderately altruistic perspective which places most of its weight on, say, the person's immediate family and friends.

Of course any cryonics-free strategy is probably dominated by that same strategy plus cryonics for a personal bet at immortality, but when it comes to friends and family it's not easy to convince people to sign up for cryonics! But immortality-maxxing for one's friends and family almost definitely entails accelerating AI even at pretty high P(doom... (read more)

$500 Bounty Problem: Are (Approximately) Deterministic Natural Latents All You Need?

J Bostock3mo40

Huh, I had vaguely considered that but I expected any $P [X | Δ (X)] = 0$ terms to be counterbalanced by $P [X, Δ (X)] = 0$ terms, which together contribute nothing to the KL-divergence. I'll check my intuitions though.

I'm honestly pretty stumped at the moment. The simplest test case I've been using is for $X_{1}$ and $X_{2}$ to be two flips of a biased coin, where the bias is known to be either $k$ or $1 - k$ with equal probability of either. As $k$ varies, we want to swap from $Δ ≅ Λ$ to the trivial case $| Δ$ ... (read more)

$500 Bounty Problem: Are (Approximately) Deterministic Natural Latents All You Need?

J Bostock3mo70

I've been working on the reverse direction: chopping up $P [Λ]$ by clustering the points (treating each distribution as a point in distribution space) given by $P [Λ | X = x]$ , optimizing for a deterministic-in- $X$ latent $Δ = Δ (X)$ which minimizes $D_{K L} (P [Λ | X] | | P [Λ | Δ (X)])$ .

This definitely separates $X_{1}$ and $X_{2}$ to some small error, since we can just use $Δ$ to build a distribution over $Λ$ which should approximately separate $X_{1}$ and $X_{2}$ .

To show that it's deterministic in $X_{1}$ (and by sy... (read more)

7johnswentworth3mo

Sounds like you've correctly understood the problem and are thinking along roughly the right lines. I expect a deterministic function of X won't work, though. Hand-wavily: the problem is that, if we take the latent to be a deterministic function Δ(X), then P[X|Δ(X)] has lots of zeros in it - not approximate-zeros, but true zeros. That will tend to blow up the KL-divergences in the approximation conditions. I'd recommend looking for a function Δ(Λ). Unfortunately that does mean that low entropy of Δ(Λ)given X has to be proven.

Lucius Bushnaq's Shortform

J Bostock3mo42

Is the distinction between "elephant + tiny" and "exampledon" primarily about the things the model does downstream? E.g. if none of the fifty dimensions of our subspace represent "has a bright purple spleen" but exampledons do, then the model might need to instead produce a "purple" vector as an output from an MLP whenever "exampledon" and "spleen" are present together.

Lucius Bushnaq's Shortform

J Bostock3mo40

Just to clarify, do you mean something like "elephant = grey + big + trunk + ears + African + mammal + wise" so to encode a tiny elephant you would have "grey + tiny + trunk + ears + African + mammal + wise" which the model could still read off as 0.86 $\times$ elephant when relevant, but also tiny when relevant.

2Lucius Bushnaq3mo

'elephant' would be a sum of fifty attribute feature vectors, all with scalar coefficients that match elephants in particular. The coefficients would tend have sizes on the order of 1√50, because the subspace is fifty-dimensional. So, if you wanted to have a pure tiny feature and an elephant feature active at the same time to encode a tiny elephant, 'elephant' and 'tiny' would be expected to have read-off interference on the order of 1√50. Alternatively, you could instead encode a new animal 'tiny elephant' as its own point in the fifty-dimensional space. Those are actually distinct things here. If this is confusing, maybe it helps to imagine that the name for 'tiny elephant' is 'exampledon', and exampledons just happen to look like tiny elephants.

Jemist's Shortform

J Bostock3mo10

I think you should pay in Counterfactual Mugging, and this is one of the newcomblike problem classes that is most common in real life.

Example: you find a wallet on the ground. You can, from least to most pro social:

Take it and steal the money from it
Leave it where it is
Take it and make an effort to return it to its owner

Let's ignore the first option (suppose we're not THAT evil). The universe has randomly selected you today to be in the position where your only options are to spend some resources to no personal gain, or not. In a parallel universe, perhaps... (read more)

Jemist's Shortform

J Bostock3mo60

I have added a link to the report now.

As to your point: this is one of the better arguments I've heard that welfare ranges might be similar between animals. Still I don't think it squares well with the actual nature of the brain. Saying there's a single suffering computation would make sense if the brain was like a CPU, where one core did the thinking, but actually all of the neurons in the brain are firing at once and doing computations in at the same time. So it makes much more sense to me to think that the more neurons are computing some sort of suffering, the greater the intensity of suffering.

3Kaj_Sotala3mo

Can you elaborate how leads to ?

1nielsrolf3mo

One intuition against this is by drawing an analogy to LLMs: the residual stream represents many features. All neurons participate in the representation of a feature. But the difference between a larger and a smaller model is mostly that the larger model can represent more features, not that the larger model represents features with greater magnitude. In humans it seems to be the case that consciousness is most strongly connected to processes in the brain stem, rather than the neo cortex. Here is a great talk about the topic - the main points are (writing from memory, might not be entirely accurate): * humans can lose consciousness or produce intense emotions (good and bad) through interventions on a very small area of the brain stem. When other much larger parts of the brain are damaged or missing, humans continue to behave in a way such that one would ascribe emotions to them from interactions, for example, they show affection. * dopamin, serotonin, and other chemicals that alter consciousness work in the brain stem If we consider the question from an evolutionary angle, I'd also argue that emotions are more important when an organism has fewer alternatives (like a large brain that does fancy computations). Once better reasoning skills become available, it makes sense to reduce the impact that emotions have on behavior and instead trust the abstract reasoning. In my own experience, the intensity in which I feel emotions is strongly correlated to how action guiding it is, and I think as a child I felt emotions more intensly than now, which also fits the hypothesis that more ability to think abstract reduces intensity of emotions.