All of J Bostock's Comments + Replies

Out of domain (i.e. on a different math benchmark) the RLed model does better at pass@256, especially when using algorithms like RLOO and Reinforce++. If there is a crossover point it is in the thousands. (Figure 7)

This seems critically important. Production models are RLed on hundreds to thousands of benchmarks.

We should also consider that, well, this result just doesn't pass the sniff test given what we've seen RL models do. o3 is a lot better than o1 in a way which suggests that RL budgets do scale heavily with xompute, and o3 if anything is better at s... (read more)

1Aaron_Scher
FWIW, I interpret the paper to be making a pretty narrow claim about RL in particular. On the other hand, a lot of the production "RL models" we have seen may not be pure RL. For instance, if you wanted to run a similar test to this paper on DeepSeek-V3+, you would compare DeepSeek-V3 to DeepSeek-R1-Zero (pure RL diff, according to the technical report), not to DeepSeek-R1 (trained with a hard-to-follow mix of SFT and RL). R1-Zero is a worse model than R1, sometimes by a large margin. 
6Thane Ruthenis
If you're referring to the ARC-AGI results, it was just pass@1024, for a nontrivial but not startling jump (75.7% to 87.5%). About the same ballpark as in the paper, plus we don't actually know how much better its pass@1024 was than its pass@256. The costs aren't due to an astronomical k, but due to it writing a 55k-token novel for each attempt plus high $X/million output tokens. (Apparently the revised estimate is $600/million??? It was $60/million initially.) (FrontierMath was pass@1. Though maybe they used consensus@k instead (outputting the most frequent answer out of k, with only one "final answer" passed to the task-specific verifier) or something.)
9Jozdien
o3 may also have a better base model. o3 could be worse at pass@n for high n relative to its base model than o1 is relative to its base model, while still being better than o1. I don't think you need very novel RL algorithms for this either - in the paper, Reinforce++ still does better for pass@256 in all cases. For very high k, pass@k being higher for the base model may just imply that the base model has a broader distribution to sample from, while at lower k the RL'd models benefit from higher reliability. This would imply that it's not a question of how to do RL such that the RL model is always better at any k, but how to trade off reliability for a more diverse distribution (and push the Pareto frontier ahead).
2Thomas Kwa
Agree, I'm pretty confused about this discrepancy. I can't rule out that it's just the "RL can enable emergent capabilities" point.

OK so some further thoughts on this: suppose we instead just partition the values of  directly by something like a clustering algorithm, based on  in  space, and take  just be the cluster that  is in:

Assuming we can do it with small clusters, we know that  is pretty small, so  is also small.

And if we consider , this tells us that learning  restricts us to a pretty small region of  space (since ... (read more)

J Bostock244

Too Early does not preclude Too Late

Thoughts on efforts to shift public (or elite, or political) opinion on AI doom.

Currently, it seems like we're in a state of being Too Early. AI is not yet scary enough to overcome peoples' biases against AI doom being real. The arguments are too abstract and the conclusions too unpleasant.

Currently, it seems like we're in a state of being Too Late. The incumbent players are already massively powerful and capable of driving opinion through power, politics, and money. Their products are already too useful and ubiquitous t... (read more)

5Seth Herd
I like this framing; we're both too early and too late. But it might transition quite rapidly from too early to right on time. One idea is to prepare strategies and arguments and perhaps prepare the soil of public discourse in preparation for the time when it is no longer too early. Job loss and actually harmful AI shenanigans are very likely before takeover-capable AGI. Preparing for the likely AI scares and negative press might help public opinion shift very rapidly as it sometimes does (e.g., COVID opinions went from no concern to shutting down half the economy very quickly). The average American and probably the average global citizen already dislikes AI. It's just the people benefitting from it that currently like it, and that's a minority. Whether that's enough is questionable, but it makes sense to try and hope that the likely backlash is at least useful in slowing progress or proliferation somewhat.

Under this formulation, FEP is very similar to RL-as-inference. But RL-as-inference is a generalization of a huge number of RL algorithms from Q-learning to LLM fine-tuning. This does kind of make sense if we think of FEP as a just a different way of looking at things, but it doesn't really help us narrow down the algorithms that the brain is actually using. Perhaps that's actually all FEP is trying to do though, and Friston has IIRC said things to that effect---that FEP is just a reframing/generalization and not an actual model of the underlying algorithms being employed.

8Yldedly
There are some conceptual differences. In RL, you define a value function for all possible states. In active inference, you make desirable sense data a priori likely. Sensory space is not only lower-dimensional than (unobserved) state space, but you only need to define a single point in it, rather than a function on the whole space. It's often a much more natural way of defining goals and is more similar to control theory than RL. You're directly optimizing for a desired (and known) outcome rather than having to figure out what to optimize for by reinforcement. For example, if you want a robot to walk to some goal point, RL would have to make the robot walk around a bit, figure out that the goal point gives high reward, and then do it (in another rollout). In active inference (and control theory), the robot already knows where the goal point is (or rather, what the world looks like when standing at that point), and merely figures out a sequence of actions that get it there.  Another difference is that active inference automatically balances exploration and exploitation, while in RL it's usually a hyperparameter. In RL, it tends to look like doing many random actions early on, to figure out what gives reward, and later on do actions that keep the agent in high-reward states. In control theory, exploration is more bespoke, and built specifically for system identification (learning a model) or adaptive control (adjusting known parameters based on observations). In active inference, there's no aimless flailing about, but the agent can do any kind of experiment that minimizes future uncertainty - testing what beliefs and actions are likely to achieve the desired sense data. Here's a nice demo of that: 
1Christopher King
Yeah my understanding is that FEP is meant to be quite general, the P and Q are doing a lot of the theory's work for it. Chapter 5 describes how you might apply it to the human brain in particular.

This seems not to be true assuming a P(doom) of 25% and a purely selfish perspective, or even a moderately altruistic perspective which places most of its weight on, say, the person's immediate family and friends.

Of course any cryonics-free strategy is probably dominated by that same strategy plus cryonics for a personal bet at immortality, but when it comes to friends and family it's not easy to convince people to sign up for cryonics! But immortality-maxxing for one's friends and family almost definitely entails accelerating AI even at pretty high P(doom... (read more)

Huh, I had vaguely considered that but I expected any  terms to be counterbalanced by  terms, which together contribute nothing to the KL-divergence. I'll check my intuitions though.

I'm honestly pretty stumped at the moment. The simplest test case I've been using is for  and  to be two flips of a biased coin, where the bias is known to be either  or  with equal probability of either. As  varies, we want to swap from  to the trivial case ... (read more)

I've been working on the reverse direction: chopping up  by clustering the points (treating each distribution as a point in distribution space) given by , optimizing for a deterministic-in- latent  which minimizes .

This definitely separates  and  to some small error, since we can just use  to build a distribution over  which should approximately separate  and .

To show that it's deterministic in  (and by sy... (read more)

7johnswentworth
Sounds like you've correctly understood the problem and are thinking along roughly the right lines. I expect a deterministic function of X won't work, though. Hand-wavily: the problem is that, if we take the latent to be a deterministic function Δ(X), then P[X|Δ(X)] has lots of zeros in it - not approximate-zeros, but true zeros. That will tend to blow up the KL-divergences in the approximation conditions. I'd recommend looking for a function Δ(Λ). Unfortunately that does mean that low entropy of Δ(Λ)given X has to be proven.

Is the distinction between "elephant + tiny" and "exampledon" primarily about the things the model does downstream? E.g. if none of the fifty dimensions of our subspace represent "has a bright purple spleen" but exampledons do, then the model might need to instead produce a "purple" vector as an output from an MLP whenever "exampledon" and "spleen" are present together.

Just to clarify, do you mean something like "elephant = grey + big + trunk + ears + African + mammal + wise" so to encode a tiny elephant you would  have "grey + tiny + trunk + ears + African + mammal + wise" which the model could still read off as 0.86  elephant when relevant, but also tiny when relevant.

2Lucius Bushnaq
'elephant' would be a sum of fifty attribute feature vectors, all with scalar coefficients that match elephants in particular. The coefficients would tend have sizes on the order of 1√50, because the subspace is fifty-dimensional. So, if you wanted to have a pure tiny feature and an elephant feature active at the same time to encode a tiny elephant, 'elephant' and 'tiny' would be expected to have read-off interference on the order of 1√50. Alternatively, you could instead encode a new animal 'tiny elephant' as its own point in the fifty-dimensional space. Those are actually distinct things here. If this is confusing, maybe it helps to imagine that the name for 'tiny elephant' is 'exampledon', and exampledons just happen to look like tiny elephants.

I think you should pay in Counterfactual Mugging, and this is one of the newcomblike problem classes that is most common in real life.

Example: you find a wallet on the ground. You can, from least to most pro social:

  1. Take it and steal the money from it
  2. Leave it where it is
  3. Take it and make an effort to return it to its owner

Let's ignore the first option (suppose we're not THAT evil). The universe has randomly selected you today to be in the position where your only options are to spend some resources to no personal gain, or not. In a parallel universe, perhaps... (read more)

I have added a link to the report now.

As to your point: this is one of the better arguments I've heard that welfare ranges might be similar between animals. Still I don't think it squares well with the actual nature of the brain. Saying there's a single suffering computation would make sense if the brain was like a CPU, where one core did the thinking, but actually all of the neurons in the brain are firing at once and doing computations in at the same time. So it makes much more sense to me to think that the more neurons are computing some sort of suffering, the greater the intensity of suffering.

3Kaj_Sotala
Can you elaborate how leads to ?
1nielsrolf
One intuition against this is by drawing an analogy to LLMs: the residual stream represents many features. All neurons participate in the representation of a feature. But the difference between a larger and a smaller model is mostly that the larger model can represent more features, not that the larger model represents features with greater magnitude. In humans it seems to be the case that consciousness is most strongly connected to processes in the brain stem, rather than the neo cortex. Here is a great talk about the topic - the main points are (writing from memory, might not be entirely accurate): * humans can lose consciousness or produce intense emotions (good and bad) through interventions on a very small area of the brain stem. When other much larger parts of the brain are damaged or missing, humans continue to behave in a way such that one would ascribe emotions to them from interactions, for example, they show affection. * dopamin, serotonin, and other chemicals that alter consciousness work in the brain stem If we consider the question from an evolutionary angle, I'd also argue that emotions are more important when an organism has fewer alternatives (like a large brain that does fancy computations). Once better reasoning skills become available, it makes sense to reduce the impact that emotions have on behavior and instead trust the abstract reasoning. In my own experience, the intensity in which I feel emotions is strongly correlated to how action guiding it is, and I think as a child I felt emotions more intensly than now, which also fits the hypothesis that more ability to think abstract reduces intensity of emotions.

Good point, edited a link to the Google Doc into the post.

J Bostock*3817

From Rethink Priorities:

  1. We used Monte Carlo simulations to estimate, for various sentience models and across eighteen organisms, the distribution of plausible probabilities of sentience.
  2. We used a similar simulation procedure to estimate the distribution of welfare ranges for eleven of these eighteen organisms, taking into account uncertainty in model choice, the presence of proxies relevant to welfare capacity, and the organisms’ probabilities of sentience (equating this probability with the probability of moral patienthood)

Now with the disclaimer that I d... (read more)

1CB
Your disagreement, from what I understand, seems mostly to stem from the fact that shrimps have less neuron than humans. Did you check RP's piece on that topic, "Why Neuron Counts Shouldn't Be Used as Proxies for Moral Weight?" https://forum.effectivealtruism.org/posts/Mfq7KxQRvkeLnJvoB/why-neuron-counts-shouldn-t-be-used-as-proxies-for-moral They say this: "In regards to intelligence, we can question both the extent to which more neurons are correlated with intelligence and whether more intelligence in fact predicts greater moral weight;  Many ways of arguing that more neurons results in more valenced consciousness seem incompatible with our current understanding of how the brain is likely to work; and There is no straightforward empirical evidence or compelling conceptual arguments indicating that relative differences in neuron counts within or between species reliably predicts welfare relevant functional capacities. Overall, we suggest that neuron counts should not be used as a sole proxy for moral weight, but cannot be dismissed entirely"
5Jeremy Gillen
Can you link to where RP says that?
niplav219

Their epistemics led them to do a Monte Carlo simulation to determine if organisms are capable of suffering (and if so, how much) then got a value of 5 shrimp = 1 human and then not bat an eye at this number.

Neither a physicalist nor a functionalist theory of consciousness can reasonably justify a number like this. Shrimp have 5 orders of magnitude fewer neurons than humans, so whether suffering is the result of a physical process or an information processing one, this implies that shrimp neurons do 4 orders of magnitude more of this process per second

... (read more)

If we approximate an MLP layer with a bilinear layer, then the effect of residual stream features on the MLP output can be expressed as a second order polynomial over the feature coefficients $f_i$. This will contain, for each feature, an $f_i^2 v_i+ f_i w_i$ term, which is "baked into" the residual stream after the MLP acts. Just looking at the linear term, this could be the source of Anthropic's observations of features growing, shrinking, and rotating in their original crosscoder paper. https://transformer-circuits.pub/2024/crosscoders/index.html

That might be true but I'm not sure it matters. For an AI to learn an abstraction it will have a finite amount of training time, context length, search space width (if we're doing parallel search like with o3) etc. and it's not clear how the abstraction height will scale with those.

Empirically, I think lots of people feel the experience of "hitting a wall" where they can learn abstraction level n-1 easily from class; abstraction level n takes significant study/help; abstraction level n+1 is not achievable for them within reasonable time. So it seems like the time requirement may scale quite rapidly with abstraction level?

J Bostock110

I second this, it could easily be things which we might describe as "amount of information that can be processed at once, including abstractions" which is some combination of residual stream width and context length.

Imagine an AI can do a task that takes 1 hour. To remain coherent over 2 hours, it could either use twice as much working memory, or compress it into a higher level of abstraction. Humans seem to struggle with abstraction in a fairly continuous way (some people get stuck at algebra; some cs students make it all the way to recursion then hit a w... (read more)

8Stephen Fowler
While each mind might have a maximum abstraction height, I am not convinced that the inability of people to deal with increasingly complex topics is direct evidence of this. Is it that this topic is impossible for their mind to comprehend, or is it that they've simple failed to learn it in the finite time period they were given?
J Bostock611

Only partially relevant, but it's exciting to hear a new John/David paper is forthcoming!

J Bostock113

Furthermore: normalizing your data to variance=1 will change your PCA line (if the X and Y variances are different) because the relative importance of X and Y distances will change!

J Bostock114

Thanks for writing this up. As someone who was not aware of the eye thing I think it's a good illustration of the level that the Zizians are on, i.e. misunderstanding key important facts about the neurology that is central to their worldview.

My model of double-hemisphere stuff, DID, tulpas, and the like is somewhat null-hypothesis-ish. The strongest version is something like this:

At the upper levels of predictive coding, the brain keeps track of really abstract things about yourself. Think "ego" "self-conception" or "narrative about yourself". This is norm... (read more)

7ChristianKl
Is worth noting that the only evidence we have that this is how unihemispheric sleep gets created comes from Zizian.info which critical of Ziz. Slimepriestess claimed in the interview with Ken that the author just made up the exercise independently. When dealing with a complex phenomena, the idea of "I'll just use the naive null hypothesis" generally does not give you a good understanding of the phenomena. It's like the theories the Greek had of how various things work that ignore a lot of the actual phenomena.  I think you are wrong if you see self-conception as independent of memories. If you take Steve Andreas model laid out in Transform Your Self, a self-concept like "I'm a kind person" is inherently build-up of memories of remembering yourself as a kind person.  With Dissociative Identity Disorder that gets caused by trauma, the traumatic memories might be too much to easily integrated into the existing self concept, so there's a need for a new personality to house those memories.

This is a very interesting point. I have upvoted this post even though I disagree with it because I think the question of "Who will pay, and how much will they pay, to restrict others' access AI?" is important.

My instinct is that this won't happen, because there are too many AI companies for this deal to work on all of them, and some of these AI companies will have strong kinda-ideological commitments to not doing this. Also, my model of (e.g. OpenAI) is that they want to eat as much of the world's economy as possible, and this is better done by selling (e... (read more)

1purple fire
Hm, this violates my model of the world. Realistically, I think there are like 3-4 labs[1] that matter, OAI, DM, Anthropic, Meta. Even if that was true, they will be at the whim of investors who are almost all big tech companies. This is the explicit claim I was making with the WTP argument. I think this is firmly not true, and OpenAI will make more money by selling just to Oracle. What evidence causes you to disagree? 1. ^ American/Western labs.
J Bostock110

That's part of what I was trying to get at with "dramatic" but I agree now that it might be 80% photogenicity. I do expect that 3000 Americans killed by (a) humanoid robot(s) on camera would cause more outrage than 1 million Americans killed by a virus which we discovered six months later was AI-created in some way.

Previous ballpark numbers I've heard floated around are "100,000 deaths to shut it all down" but I expect the threshold will grow as more money is involved. Depends on how dramatic the deaths are though, 3000 deaths was enough to cause the US to invade two countries back in the 2000s. 100,000 deaths is thirty-three 9/11s.

GeneSmith144

I think the response to 9/11 was an outlier mostly caused by the "photogenic" nature of the disaster. COVID killed over a million Americans yet we basically forgot about it once it was gone. We haven't seen much serious investment in measures to prevent a new pandemic.

Is there a particular reason to not include sex hormones? Some theories suggest that testosterone tracks relative social status. We might expect that high social status -> less stress (of the cortisol type) + more metabolic activity. Since it's used by trans people we have a pretty good idea of what it does to you at high doses (makes you hungry, horny, and angry) but its unclear whether it actually promotes low cortisol-stress and metabolic activity.

I'm mildly against this being immortalized as part of the 2023 review, though I think it serves excellently as a community announcement for Bay Area rats, which seems to be its original purpose.

I think it has the most long-term relevant information (about AI and community building) back loaded and the least relevant information (statistics and details about a no-longer-existent office space in the Bay Area) front loaded. This is a very Bay Area centric post, which I don't think is ideal.

A better version of this post would be structured as a round up of the main future-relevant takeaways, with specifics from the office space as examples.

I'm only referring to the reward constraint being satisfied for scenarios that are in the training distribution, since this maths is entirely applied to a decision taking place in training. Therefore I don't think distributional shift applies.

2abramdemski
Ah yep, that's a good clarification.

I haven't actually thought much about particular training algorithms yet. I think I'm working on a higher level of abstraction than that at the moment, since my maths doesn't depend on any specifics about V's behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.

I'm also imagining that during training, V is made up of different circuits which might be reinforced or weakened.

My view is that, if V is shaped by a training process like this, t... (read more)

I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.

All arguments break down a bit when introduced to the real world. Is there a particular reason why the approximation error to argument breakdown ratio should be particularly high in this case? 

Example, if we introduce some error to the beta-coherence assumption:

Assume beta_t = 1, beta_s = 0.5, r_1 = 1, r_2 = 0.

V(s_0) = e/(1+e) +/- delta = 0.732 +/- delta

Actual expected value = 0.622

Even if |delta| = 0.1 the system cannot be coherent over training in this case. This seems to be relatively robust to me

2abramdemski
Yeah, of course the notion of "approximation error" matters a lot here. I was imagining trying to implement the hypothetical pre-training. An obvious implementation strategy is to do regular RL training with βt until V is approximately coherent. That's the pre-training. And then you switch to training with βs.[1] The natural notion of approximation error in this case is the one you can easily measure, IE, the expected distance from coherence of states sampled with temperature βt. This reflects the fact that it'll be hard for this pre-training step to make states individually obey the coherence constraint when those states are very rarely or never sampled at temperature βt, but easy for frequently-sampled states. My argument, then, is that it is easy in principle for V to distinguish whether it is in pre-training or regular training when it encounters states which have a significant difference in probability when we're sampling according to βt or βs. A scheming V can utilize this to self-preserve. This violates the assumption of βt-coherence, but in a very plausible-seeming way. 1. ^ My earlier comment about this mistakenly used β1 and β2 in place of βt and βs, which may have been confusing. I'll go fix that to be consistent with your notation.

This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).

I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don't think that's relevant because I still think a process obeyi... (read more)

4abramdemski
To be clear, that's not the argument I was trying to make; I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.

Trained with what procedure, exactly?

Fair point. I was going to add that I don't really view this as a "proposal" but more of an observation. We will have to imagine a procedure which converges on correctness and beta-coherence. I was abstracting this away because I don't expect something like this to be too hard to achieve.

Since I've evidently done a bad job of explaining myself, I'll backtrack and try again:

There's a doom argument which I'll summarize as "if your training process generates coherent agents which succeed at a task, one solution is that ... (read more)

4abramdemski
With you so far. OK. I think I follow the line of reasoning now. The proposal relies on some very special pre-training process. I think I buy the conclusion given the assumptions. It seems clear in the extreme high-temperature case: V cannot reward-hack for max-temperature training, because it has no control over which actions are chosen. Training eventually forces V to converge to the desired target instead. This sounds similar to the two-stage training process I mentioned earlier (as one of several guesses at what you meant), where the beta is shifted between the two training phases. The "hypothetical process to generate agents which are coherent under one beta" would be the first training phase, and then the "apply a different beta during training" would be the second training phase. Of course, the two-stage-training attempt to implement this has the problem that it only achieves approximate coherence. As a result, the malign hypothesis can use the reward-hacking strategy "appear to be βt-coherent & aligned in the first phase of training; appear to be βs-coherent and aligned in the second phase of training; then, do some other thing when deployed". This relies on detecting the distributional shift between the two training phases (it can look for evidence of beta by examining the history), and also detecting distributional shift to deployment. This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta). So, while it remains technically true that V cannot reward-hack in the limit of arbitrarily long max-temperature training, it is not clear how well this holds up in practice. V can detect that it is in a high-temperature configuration by observing the state. The training n

The argument could also be phrased as "If an AI is trained to be coherent wrt a high beta, it cannot also be coherent wrt a low beta. Therefore an AI trained to a high beta cannot act coherently over multiple independent RL episodes if sampled with a low beta."

The contradiction that I (attempt to) show only arises because we assume that the value function is totally agnostic of the state actually reached during training, other than due to its effects on a later deployed AI.

Therefore a value function trained with such a procedure must consider the state rea... (read more)

2abramdemski
Is the idea to train with high beta and then use lower beta post-training?  * If so, how does this relate to reward hacking and value preservation? IE, where do V1 and V2 come from, if they aren't the result of a further training step? If high beta is used during training (to achieve beta-coherence) and then low beta is used in production, then the choice between V1 and V2 must be made in production (since it is made with low beta), but then it seems like V1=V2. * If not, then when does the proposal suggest to use high beta vs low beta? If low beta is used during training, then how is it that V is coherent with respect to high beta instead? Another concern I have is that if both beta values are within a range that can yield useful capabilities, it seems like the difference cannot be too great. IIUC, the planning failure postulated can only manifest if the reward-hacking relies heavily on a long string of near-optimal actions, which becomes improbable under increased temperature. Any capabilities which similarly rely on long strings of near-optimal actions will similarly be hurt. (However, this concern is secondary to my main confusion.) Trained with what procedure, exactly? (These parts made sense to me modulo my other questions/concerns/confusions.)

I think you're right, correctness and beta-coherence can be rolled up into one specific property. I think I wrote down correctness as a constraint first, then tried to add coherence, but the specific property is that:

For non-terminal s, this can be written as:

If s is terminal then [...] we just have .

Which captures both. I will edit the post to clarify this when I get time.

2abramdemski
If the probability of eventually encountering a terminal state is 1, then beta-coherence alone is inconsistent with deceptive misalignment, right? That's because we can determine the value of V exactly from the reward function and the oracle, via backwards-induction. (I haven't revisited RL convergence theorems in a while, I suspect I am not stating this quite right.) I mean, it is still consistent in the case where r is indifferent to the states encountered during training but wants some things in deployment (IE, r is inherently consistent with the provided definition of "deceptively misaligned"). However, it would be inconsistent for r that are not like that. In other words: you cannot have inner-alignment problems if the outer objective is perfectly imposed. You can only have inner-alignment problems if there are important cases which your training procedure wasn't able to check (eg, due to distributional shift, or scarcity of data). Perfect beta-coherence combined with a perfect oracle O rules this out.
2Joseph Miller
Is that rolling up two things into one, or is that just beta-coherence?

I somehow missed that they had a discord! I couldn't find anything on mRNA on their front-facing website, and since it hasn't been updated in a while I assumed they were relatively inactive. Thanks! 

Thinking back to the various rationalist attempts to make vaccine. https://www.lesswrong.com/posts/niQ3heWwF6SydhS7R/making-vaccine For bird-flu related reasons. Since then, we've seen mRNA vaccines arise as a new vaccination method. mRNA vaccines have been used intra-nasally for COVID with success in hamsters. If one can order mRNA for a flu protein, it would only take mixing that with some sort of delivery mechanism (such as Lipofectamine, which is commercially available) and snorting it to get what could actually be a pretty good vaccine. Has RaDVac or similar looked at this?

  1. Thanks for catching the typo.
  2. Epistemic status has been updated to clarify that this is satirical in nature.
2noggin-scratcher
Oh I was very on board with the sarcasm. Although as a graduate of one of them, I obviously can't believe you're rating the other one so highly.
J Bostock10-3

I don't think it was unforced

You're right, "unforced" was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.

Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research -> research loop. Where it fails is in being good for comms. Starting with a "good" model and trying (and failing) to make it "evil" means that anyone using the paper for comms has to introduce a layer of abstraction into their... (read more)

J Bostock*5627

Edited for clarity based on some feedback, without changing the core points

To start with an extremely specific example that I nonetheless think might be a microcosm of a bigger issue: the "Alignment Faking in Large Language Models" contained a very large unforced error: namely that you started with Helpful-Harmless-Claude and tried to train out the harmlessness, rather than starting with Helpful-Claude and training in harmlessness. This made the optics of the paper much more confusing than it needed to be, leading to lots of people calling it "good news". ... (read more)

contained a very large unforced error

It's possible this was a mistake and we should have more aggressively tried to explore versions of the setting where the AI starts off more "evil", but I don't think it was unforced. We thought about this a bunch and considered if there were worthwhile things here.

Edit: regardless, I don't think this example is plausibly a microcosm of a bigger issue as this choice was mostly made by individual researchers without much top down influence. (Unless your claim is that there should have been more top down influence.)

9davekasten
FWIW re: the Dario 2025 comment, Anthropic very recently posted a few job openings for recruiters focused on policy and comms specifically, which I assume is a leading indicator for hiring. One plausible rationale there is that someone on the executive team smashed the "we need more people working on this, make it happen" button.
J Bostock314

Since I'm actually in that picture (I am the one with the hammer) I feel an urge to respond to this post. The following is not the entire endorsed and edited worldview/theory of change of Pause AI, it's my own views. It may also not be as well thought-out as it could be.

Why do you think "activists have an aura of evil about them?" in the UK where I'm based, we usually see a large march/protest/demonstration every week. Most of the time, the people who agree with the activists are vaguely positive and the people who disagree with the activists are vaguely n... (read more)

4Eneasz
My answer is going to be unsatisfying - entirely vibes. While there are still significant sections of the populace that have left-over affection for anything that looks like the Civil Rights movement due to how valorized that movement is and how much change it affected, this is seriously waning. The non-effectiveness of movements that just copy the aesthetics are slowly making them look more like cargo-cults that copy the form but without an understanding of the substance that made them successful. As more people dismiss protestors as performance without substance, protests start getting more awful to get anyone's attention. Destroying social value and public goods for a cause no one else cares about grows increasingly irksome. When major lawlessness threatens people and sets fire to city blocks in the name of activism the good will drains away pretty rapidly. Now the cargo cults are just destroying stuff without any path to how that's supposed to make things better. It's an ongoing change. We're only seeing the start of it. But IMO its pretty undeniable that a decent percentage of the population thinks of activists as default harmful, and a preference cascade is just over the horizon.
J Bostock131

This, more than the original paper, or the recent Anthropic paper, is the most convincingly-worrying example of AI scheming/deception I've seen. This will be my new go-to example in most discussions. This comes from first considering a model property which is both deeply and shallowly worrying, then robustly eliciting it, and finally ruling out alternative hypotheses.

J Bostock15-26

I think it's very unlikely that a mirror bacterium would be a threat. <1% chance of a mirror-clone being a meaningfully more serious threat to humans as a pathogen than the base bacterium. The adaptive immune system just isn't chirally dependent. Antibodies are selected as needed from a huge library, and you can get antibodies to loads of unnatural things (PEG, chlorinated benzenes, etc.). They trigger attack mechanisms like MAC which attacks membranes in a similarly independent way.

In fact, mirror amino acids already somewhat common in nature! Bacteria... (read more)

3P. João
It seems you’ve considered a lot of interesting variables, which would likely lower the overall probability.
8dr_s
The antibodies not being chiral dependent doesn't mean there aren't other fundamental links in the whole chain that leads to antibodies being deployed at all that may not be. Mostly I imagine the risk is that we have a lot of systems optimized for dealing with life of a certain chirality. They may be able to cope with the opposite chirality, but less so. COVID alone showed what happens when something far less alien but that is just barely out of distribution for our current immune defenses arrives: literally everyone in the world gets it in a matter of months, a non-insignificant percentage dies even if the pathogen itself would be no more complex or virulent than others we deal with on the daily. And COVID was easy mode. We have examples of far more apocalyptic outcomes from immune naive populations getting in contact with new pathogens. Here we're not even talking about somehow innocuous entities. E. Coli can and will kill you if it gets in the wrong place while your defenses are down, no mirroring necessary. Staph. Aureus is everywhere already and will eat your flesh while you still live if given the chance. The only reason why we coexist with these threats is that we are in an armed truce: they can stay within their turf, but as soon as they try and go where they don't belong, they get terminated with maximum prejudice. Immuno-compromised people have to fear them a lot more. Imagining a version of them that is both antibiotic resistant (because I bet that's also a consequence of chirality) and able to evade at least the first few layers of immune defenses, until somehow the system scrambles to compensate and manages to churn out a counter-measure, is terrifying enough. That the immune system may eventually cope with them doesn't mean it wouldn't be an apocalyptic pandemic (and worse, one that affects man and animal alike, all at once).

Yes, antibodies could adapt to mirror pathogens. The concern is that the system which generates antibodies wouldn't be strongly triggered. The Science article says: “For example, experiments show that mirror proteins resist cleavage into peptides for antigen presentation and do not reliably trigger important adaptive immune responses such as the production of antibodies (11, 12).”

8DirectedEvolution
Acquired immune systems (antibodies, T cells) are restricted to jawed vertebrates.
5cdt
I think this is the crux of the different feelings around this paper. There are a lot of unknowns here. The paper does a good job of acknowledging this and (imo) it justifies a precautionary approach, but I think the breadth of uncertainty is difficult to communicate in e.g. policy briefs or newspaper articles.

I think the risk of infection to humans would be very low. The human body can generate antibodies to pretty much anything (including PEG, benzenes, which never appear in nature) by selecting protein sequences from a huge library of cells. This would activate the complement system which targets membranes and kills bacteria in a non-chiral way.

The risk to invertebrates and plants might be more significant, not sure about the specifics of plant immune system.

J Bostock242

So Sonnet 3.6 can almost certainly speed up some quite obscure areas of biotech research. Over the past hour I've got it to:

  1. Estimate a rate, correct itself (although I did have to clock that it's result was likely off by some OOMs, which turned out to be 7-8), request the right info, and then get a more reasonable answer.
  2. Come up with a better approach to a particular thing than I was able to, which I suspect has a meaningfully higher chance of working than what I was going to come up with.

Perhaps more importantly, it required almost no mental effort on my ... (read more)

1Qumeric
I think you might find this paper relevant/interesting: https://aidantr.github.io/files/AI_innovation.pdf TL;DR: Research on LLM productivity impacts in material disocery. Main takeaways: * Significant productivity improvement overall * Mostly at idea generation phase * Top performers benefit much more (because they can evaluate AI's ideas well) * Mild decrease in job satisfaction (AI automates most interesting parts, impact partly counterbalanced by improved productivity)

In practice, sadly, developing a true ELM is currently too expensive for us to pursue (but if you want to fund us to do that, lmk). So instead, in our internal research, we focus on finetuning over pretraining. Our goal is to be able to teach a model a set of facts/constraints/instructions and be able to predict how it will generalize from them, and ensure it doesn’t learn unwanted facts (such as learning human psychology from programmer comments, or general hallucinations).

 

This has reminded me to revisit some work I was doing a couple of months ago ... (read more)

Shrimp have ultra tiny brains, with less than 0.1% of human neurons.

Humans have 1e11 neurons, what's the source for shrimp neuron count? The closest I can find is lobsters having 1e5 neurons, and crabs having 1e6 (all from Google AI overview) which is a factor of much more than 1,000.

3Arturo Macias
This is the kind of criticism I kindly welcome. I used the cockroach data (forebrain) here as a Proxy: https://en.m.wikipedia.org/wiki/List_of_animals_by_number_of_neurons#:~:text=The human brain contains 86,neurons in the cerebral cortex.

I volunteer to play Minecraft with the LLM agents. I think this might be one eval where the human evaluators are easy to come by.

1Yonatan Cale
:)   If you want to try it meanwhile, check out https://github.com/MineDojo/Voyager

Ok: I'll operationalize the ratio of first choices the first group (Stop/PauseAI) to projects in the third and fourth groups (mech interp, agent foundations) for the periods 12th-13th vs 15th-16th. I'll discount the final day since the final-day-spike is probably confounding.

4Linda Linsefors
12th-13th * 18 total applications * 2 (11%) Stop/Pause AI  * 7 (39%) Mech-Interp and Agent Foundations  15th-16th * 45 total application * 4 (9%) Stop/Pause AI * 20 (44%) Mech-Interp and Agent Foundations  All applications * 370 total * 33 (12%) Stop/Pause AI * 123 (46%) Mech-Interp and Agent Foundations  Looking at the above data, is directionally correct for you hypothesis, but it doesn't look statisically significant to me. The numbers are pretty small, so could be a fluke. So I decided to add some more data  10th-11th * 20 total applications * 4 (20%) Stop/Pause AI  * 8 (40%) Mech-Interp and Agent Foundations  Looking at all of it, it looks like Stop/Pause AI are coming in at a stable rate, while Mech-Interp and Agent Foundations are going up a lot after the 14th.  

It might be the case that AISC was extra late-skewed because the MATS rejection letters went out on the 14th (guess how I know) so I think a lot of people got those and then rushed to finish their AISC applications (guess why I think this) before the 17th. This would predict that the ratio of technical:less-technical applications would increase in the final few days.

2Linda Linsefors
Sounds plausible.  > This would predict that the ratio of technical:less-technical applications would increase in the final few days. If you want to operationalise this in terms on project first choice, I can check.  

For a good few years you'd have a tiny baby limb, which would make it impossible to have a normal prosthetic. I also think most people just don't want a tiny baby limb attached to them. I don't think growing it in the lab for a decade is feasible for a variety of reasons. I also don't know how they planned to wire the nervous system in, or ensure the bone sockets attach properly, or connect the right blood vessels. The challenge is just immense and it gets less and less worth over time it as trauma surgery and prosthetics improve.

The regrowing limb thing is a nonstarter due to the issue of time if I understand correctly. Salamanders that can regrow limbs take roughly the same amount of time to regrow them as the limb takes to grow in the first place. So it would be 1-2 decades before the limb was of adult size. Secondly it's not as simple as just smearing on some stem cells to an arm stump. Limbs form because of specific signalling molecules in specific gradients. I don't think these are present in an adult body once the limb is made. So you'd need a socket which produces those which you'd have to build in the lab, attach to blood supply to feed the limb, etc.

0Purplehermann
The first issue seems minor - even if true, a 40 year old man could have a new arm by 60

My model: suppose we have a DeepDreamer-style architecture, where (given a history of sensory inputs) the babbler module produces a distribution over actions, a world model predicts subsequent sensory inputs, and an evaluator predicts expected future X. If we run a tree-search over some weighted combination of the X, Y, and Z maximizers' predicted actions, then run each of the X, Y, and Z maximizers' evaluators, we'd get a reasonable approximation of a weighted maximizers.

This wouldn't be true if we gave negative weights to the maximizers, because while th... (read more)

Seems like if you're working with neural networks there's not a simple map from an efficient (in terms of program size, working memory, and speed) optimizer which maximizes X to an equivalent optimizer which maximizes -X. If we consider that an efficient optimizer does something like tree search, then it would be easy to flip the sign of the node-evaluating "prune" module. But the "babble" module is likely to select promising actions based on a big bag of heuristics which aren't easily flipped. Moreover, flipping a heuristic which upweights a small subset ... (read more)

2JBlack
How do you construct a maximizer for 0.3X+0.6Y+0.1Z from three maximizers for X, Y, and Z? It certainly isn't true in general for black box optimizers, so presumably this is something specific to a certain class of neural networks.
5habryka
True if you don't count the training process as part of the optimizer (which is a choice that sometimes makes sense and sometimes doesn't). If you count the training process as part of the optimizer, then you can of course just flip your loss function or RL signal most of the time.
Load More