All of Bogdan Ionut Cirstea's Comments + Replies

Bogdan Ionut Cirstea1mo200

The first automatically produced, (human) peer-reviewed, (ICLR) workshop-accepted[/able] AI research paper: https://sakana.ai/ai-scientist-first-publication/

nostalgebraist1mo*2313

This is a very low-quality paper.

Basically, the paper does the following:

A 1-layer LSTM gets inputs of the form [operand 1][operator][operand 2], e.g. 1+2 or 3*5
It is trained (I think with a regression loss? but it's not clear^[1]) to predict the numerical result of the binary operation
The paper proposes an auxiliary loss that is supposed to improve "compositionality."
- As described in the paper, this loss is is the average squared difference between successive LSTM hidden states
- But, in the actual code, what is actually is instead the average squared differen

... (read more)

faul_sname1mo100

I'm skeptical.

Did the Sakana team publish the code that their scientist agent used to write the compositional regularization paper? The post says

For our choice of workshop, we believe the ICBINB workshop is a highly relevant choice for the purpose of our experiment. As we wrote in the main text, we selected this workshop because of its broader scope, challenging researchers (and our AI Scientist) to tackle diverse research topics that address practical limitations of deep learning, unlike most workshops with a narrow focus on one topic.
This workshop focuse

... (read more)

Detecting Strategic Deception Using Linear Probes

Bogdan Ionut Cirstea2mo00

There have been numerous scandals within the EA community about how working for top AGI labs might be harmful. So, when are we going to have this conversation: contributing in any way to the current US admin getting (especially exclusive) access to AGI might be (very) harmful?

[cross-posted from X]

Bogdan Ionut Cirstea3mo40

I find the pessimistic interpretation of the results a bit odd given considerations like those in https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed.

6Lucius Bushnaq3mo

I don't see how the linked post would suggest a more optimistic interpretation of the results? The baseline is just asking GPT-4o to look at the chain of thought and guess whether the other model is lying. If linear probes and SAEs struggle to beat that baseline, then they're really not delivering on the promise of interpretability for evals many hoped for: Making manifest information about the model's cognition that isn't visible in the chain of thought. EDIT: It is technically possible for the linear probes to pick up on hidden information and still do no better than blackbox due to other shortcomings, I guess.

Daniel Kokotajlo's Shortform

Bogdan Ionut Cirstea3moΩ240

I also think it's important to notice how much less scary / how much more probably-easy-to-mitigate (at least strictly when it comes to technical alignment) this story seems than the scenarios from 10 years ago or so, e.g. from Superintelligence / from before LLMs, when pure RL seemed like the dominant paradigm to get to AGI.

2Daniel Kokotajlo3mo

I don't think it's that much better actually. It might even be worse. See this comment:

Daniel Kokotajlo's Shortform

Bogdan Ionut Cirstea3mo20

I agree it's bad news w.r.t. getting maximal evidence about steganography and the like happening 'by default'. I think it's good news w.r.t. lab incentives, even for labs which don't speak too much about safety.

Daniel Kokotajlo's Shortform

Bogdan Ionut Cirstea3moΩ130

I pretty much agree with 1 and 2. I'm much more optimistic about 3-5 even 'by default' (e.g. R1's training being 'regularized' towards more interpretable CoT, despite DeepSeek not being too vocal about safety), but especially if labs deliberately try for maintaining the nice properties from 1-2 and of interpretable CoT.

4Bogdan Ionut Cirstea3mo

7Daniel Kokotajlo3mo

This is bad actually. They are mixing process-based and outcome-based feedback. I think the particular way they did it (penalizing CoT that switches between languages) isn't so bad, but it's still a shame because the point of faithful CoT is to see how the model really thinks 'naturally.' Training the CoT to look a certain way is like training on the test set, so to speak. It muddies the results. If they hadn't done that, then we could learn something interesting probably by analyzing the patterns in when it uses english vs. chinese language concepts.

Dario Amodei: On DeepSeek and Export Controls

Bogdan Ionut Cirstea3mo93

If "smarter than almost all humans at almost all things" models appear in 2026-2027, China and several others will be able to ~immediately steal the first such models, by default.

Interpreted very charitably: but even in that case, they probably wouldn't have enough inference compute to compete.

Human takeover might be worse than AI takeover

Bogdan Ionut Cirstea3mo90

Quick take: this is probably interpreting them over-charitably, but I feel like the plausibility of arguments like the one in this post makes e/acc and e/acc-adjacent arguments sound a lot less crazy.

6Tom Davidson3mo

I think rushing full steam ahead with AI increases human takeover risk

9Mitchell_Porter3mo

Yes, this is nothing like e/acc arguments. e/acc don't argue in favour of AI takeover; they refuse to even think about AI takeover. e/acc was "we need more AI for everyone now, or else the EAs will trap us all in stagnant woke dystopia indefinitely". Now it's "American AI must win or China will trap us in totalitarian communist dystopia indefinitely".

Hopenope's Shortform

Bogdan Ionut Cirstea3mo41

To the best of my awareness, there isn't any demonstrated proper differential compute efficiency from latent reasoning to speak of yet. It could happen, it could also not happen. Even if it does happen, one could still decide to pay the associated safety tax of keeping the CoT.

More generally, the vibe of the comment above seems too defeatist to me; related: https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated.

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Bogdan Ionut Cirstea3mo30

They also require relatively little compute (often around $1 for a training run), so AI agents could afford to test many ideas.

Ok, this seems surprisingly cheap. Can you say more about what such a 1$ training run typically looks like (what the hyperparameters are)? I'd also be very interested in any analysis about how SAE (computational) training costs scale vs. base LLM pretraining costs.

I wouldn't be surprised if SAE improvements were a good early target for automated AI research, especially if the feedback loop is just "Come up with idea, modify existin

... (read more)

7Adam Karvonen3mo

A $1 training run would be training 6 SAEs across 6 sparsities at 16K width on Gemma-2-2B for 200M tokens. This includes generating the activations, and it would be cheaper if the activations are precomputed. In practice this seems like large enough scale to validate ideas such as the Matryoshka SAE or the BatchTopK SAE.

Before smart AI, there will be many mediocre or specialized AIs

Bogdan Ionut Cirstea3mo40

Some additional evidence: o3 used 5.7B tokens per task to achieve its ARC score of 87.5%; it also scored 75.7% on low compute mode using 33M tokens per task:

https://arcprize.org/blog/oai-o3-pub-breakthrough

5 ways to improve CoT faithfulness

Bogdan Ionut Cirstea3mo20

It might also be feasible to use multimodal CoT, like in Imagine while Reasoning in Space: Multimodal Visualization-of-Thought, and consistency checks between the CoTs in different modalities. Below are some quotes from a related chat with Claude about this idea.

multimodal CoT could enhance faithfulness through several mechanisms: cross-modal consistency checking, visual grounding of abstract concepts, increased bandwidth costs for deception, and enhanced interpretability. The key insight is that coordinating deception across multiple modalities would be s

... (read more)

Charlie Steiner's Shortform

Bogdan Ionut Cirstea3mo15-4

At the very least, evals for automated ML R&D should be a very decent proxy for when it might be feasible to automate very large chunks of prosaic AI safety R&D.

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Bogdan Ionut Cirstea3mo40

We find that these evaluation results are nuanced and there is no one ideal SAE configuration - instead, the best SAE varies depending on the specifics of the downstream task. Because of this, we cannot combine the results into a single number without obscuring tradeoffs. Instead, we provide a range of quantitative metrics so that researchers can measure the nuanced effects of experimental changes.

It might be interesting (perhaps in the not-very-near future) to study if automated scientists (maybe roughly in the shape of existing ones, like https://sakana.... (read more)

3Adam Karvonen3mo

SAEs are early enough that there's tons of low hanging fruit and ideas to try. They also require relatively little compute (often around $1 for a training run), so AI agents could afford to test many ideas. I wouldn't be surprised if SAE improvements were a good early target for automated AI research, especially if the feedback loop is just "Come up with idea, modify existing loss function, train, evaluate, get a quantitative result".

Building AI Research Fleets

Bogdan Ionut Cirstea3mo*102

I expect that, fortunately, the AI safety community will be able to mostly learn from what people automating AI capabilities research and research in other domains (more broadly) will be doing.

It would be nice to have some hands-on experience with automated safety research, too, though, and especially to already start putting in place the infrastructure necessary to deploy automated safety research at scale. Unfortunately, AFAICT, right now this seems mostly bottlenecked on something like scaling up grantmaking and funding capacity, and there doesn't... (read more)

2jacquesthibs3mo

Agreed, but I will find a way.

How will we update about scheming?

Bogdan Ionut Cirstea4mo50

From https://x.com/__nmca__/status/1870170101091008860:

o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1

@ryan_greenblatt Shouldn't this be interpreted as a very big update vs. the neuralese-in-o3 hypothesis?

7ryan_greenblatt4mo

5anaguma4mo

An LLM trained with a sufficient amount of RL maybe could learn to compress its thoughts into more efficient representations than english text, which seems consistent with the statement. I'm not sure if this is possible in practice; I've asked here if anyone knows of public examples.

What o3 Becomes by 2028

Bogdan Ionut Cirstea4mo70

Do you have thoughts on the apparent recent slowdown/disappointing results in scaling up pretraining? These might suggest very diminishing returns in scaling up pretraining significantly before 6e27 FLOP.

Vladimir_Nesov4mo100

They've probably scaled up 2x-4x compared to the previous scale of about 8e25 FLOPs, it's not that far (from 30K H100 to 100K H100). One point as I mentioned in the post is inability to reduce minibatch size, which might make this scaling step even less impactful than it should be judging from compute alone, though that doesn't apply to Google.

In any case this doesn't matter yet, since the 1 GW training systems are already being built (in case of Nvidia GPUs with larger scale-up worlds of GB200 NVL72), the decision to proceed to the yet-unobserved next lev... (read more)

Daniel Tan's Shortform

Bogdan Ionut Cirstea4mo50

Bogdan Ionut Cirstea4mo150

Gemini 2.0 Flash Thinking is claimed to 'transparently show its thought process' (in contrast to o1, which only shows a summary): https://x.com/denny_zhou/status/1869815229078745152. This might be at least a bit helpful in terms of studying how faithful (e.g. vs. steganographic, etc.) the Chains of Thought are.

4eggsyntax4mo

Other recent models that show (at least purportedly) the full CoT: * Deepseek R1-Lite (note that you have to turn 'DeepThink' on) * Qwen QwQ-32B

4Seth Herd4mo

Huge if true! Faithful Chain of Thought may be a key factor in whether the promise of LLMs as ideal for alignment pays off, or not. I am increasingly concerned that OpenAI isn't showing us o1s CoT because it's using lots of jargon that's heading toward a private language. I hope it's merely that it didn't want to show its unaligned "thoughts", and to prevent competitors from training on its useful chains of thought.

Testing which LLM architectures can do hidden serial reasoning

Bogdan Ionut Cirstea4mo20

A fairer comparison would probably be to actually try hard at building the kind of scaffold which could use ~10k$ in inference costs productively. I suspect the resulting agent would probably not do much better than with 100$ of inference, but it seems hard to be confident. And it seems harder still to be confident about what will happen even in just 3 years' time, given that pretraining compute seems like it will probably grow about 10x/year and that there might be stronger pushes towards automated ML.

A related announcement, explicitly targeting... (read more)

Bogdan Ionut Cirstea4mo40

The toy task reminded me of the 'Hidden side objectives' subsection in section 'B.1.2 OBFUSCATED SCHEMING REASONING' of Towards evaluations-based safety cases for AI scheming.

Before smart AI, there will be many mediocre or specialized AIs

Bogdan Ionut Cirstea4mo20

For the SOTA on swebench-verified as of 16-12-2024: 'it was around $5k for a total run.. around 8M tokens for a single swebench-problem.'

Densing Law of LLMs

Thoughts on sharing information about language model capabilities

'That means, around three months, it is possible to achieve performance comparable to current state-of-the-art LLMs using a model with half the parameter size.'

If this trend continues, combined with (better/more extensible) inference scaling laws, it could make LM agents much more competitive on many AI R&D capabilities soon, at much longer horizon tasks.

E.g. - figure 11 from RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts:

Also related: Before smart AI, there will be many mediocre or specialized AIs.... (read more)

Bogdan Ionut Cirstea5mo*20

The kind of instrumental reasoning required for alignment faking seems relevant, including through n-hop latent reasoning; see e.g. section 'B.1.3 HIDDEN SCHEMING REASONING' from Towards evaluations-based safety cases for AI scheming. I wouldn't be too surprised if models could currently bypass this through shortcuts, but a mix of careful data filtering + unlearning of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_p... (read more)

Thoughts on sharing information about language model capabilities

Bogdan Ionut Cirstea5mo2-2

I don't dispute that transformers can memorize shortcuts. I do dispute their ability to perform latent (opaque) multi-hop reasoning robustly. And I think this should be (very) non-controversial; e.g. Mor Geva has many papers on this topic.

4habryka5mo

What is plausibly a valid definition of multi-hop reasoning that we care about and that excludes getting mathematical proofs right and answering complicated never-before-seen physics questions and doing the kind of thing that a smaller model needed to do a CoT for?

Thoughts on sharing information about language model capabilities

Bogdan Ionut Cirstea5mo2-3

I'm pointing out that transformers seem really bad at internal multi-hop reasoning; currently they can't even do 2-hop robustly, 3-hop robustly seems kind of out of the question right now, and scaling doesn't seem to help much either (see e.g. Figures 2 and 3 in Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? and also how much more robust and scalable CoT reasoning is). So 'chain-of-thought but internalized to the model will take over' seems very unlikely with transformers, and much more so if basic mitigations lik... (read more)

3habryka5mo

Transformers are obviously capable of doing complicated internal chains of reasoning. Just try giving them a difficult problem and force them to start their answer in the very next token. You will see no interpretable or visible traces of their reasoning, but they will still get it right for almost all questions. Visible CoT is only necessary for the frontier of difficulty. The rest is easily internalized.

Thoughts on sharing information about language model capabilities

Bogdan Ionut Cirstea5mo2-4

I do indeed predict that we will see chain-of-thought become less faithful as model capabilities increase, and that other ways of doing the same thing as chain-of-thought but internalized to the model will take over.

This prediction seems largely falsified as long as transformers remain the dominant architecture, and especially if we deliberately add optimization pressures towards externalized reasoning and against internal, latent reasoning; see e.g. Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? and LLMs Do Not Think Step-by-step In Implicit Reasoning.

2habryka5mo

I do not understand your comment at all. Why would it be falsified? Transformers are completely capable of steganography if you apply pressure towards it, which we will (and have done). In Deepseek we can already see weird things happening in the chain of thought. I will happily take bets that we will see a lot more of that.

Brief analysis of OP Technical AI Safety Funding

Thanks for this post!

This looks much worse than I thought it would, both in terms of funding underdeployment, and in terms of overfocusing on evals.

Bogdan Ionut Cirstea5mo00

Claude Sonnet-3.5 New, commenting on the limited scalability of RNNs, when prompted with 'comment on what this would imply for the scalability of RNNs, refering (parts of) the post' and fed https://epoch.ai/blog/data-movement-bottlenecks-scaling-past-1e28-flop (relevant to opaque reasoning, out-of-context reasoning, scheming):

'Based on the article's discussion of data movement bottlenecks, RNNs (Recurrent Neural Networks) would likely face even more severe scaling challenges than Transformers for several reasons:

Sequential Nature: The article mentions pipe

... (read more)

4Lucas Teixeira5mo

I'm curious how these claims relate to what's proposed by this paper. (note, I haven't read either in depth)

Bogdan Ionut Cirstea5mo70

If this generalizes, OpenAI's Orion, rumored to be trained on synthetic data produced by O1, might see significant gains not just in STEM domains, but more broadly - from O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?:

'this study reveals how simple distillation from O1's API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks. Through extensive experiments, we show that a base model fine-tuned on simply tens of thousands of sampl... (read more)

Before smart AI, there will be many mediocre or specialized AIs

Bogdan Ionut Cirstea5mo156

QwQ-32B-Preview was released open-weights, seems comparable to o1-preview. Unless they're gaming the benchmarks, I find it both pretty impressive and quite shocking that a 32B model can achieve this level of performance. Seems like great news vs. opaque (e.g in one-forward-pass) reasoning. Less good with respect to proliferation (there don't seem to be any [deep] algorithmic secrets), misuse and short timelines.

8Vladimir_Nesov5mo

From proliferation perspective, it reduces overhang, makes it more likely that Llama 4 gets long reasoning trace post-training in-house rather than later, and so initial capability evaluations give more relevant results. But if Llama 4 is already training, there might not be enough time for the technique to mature, and Llamas have been quite conservative in their techniques so far.

1mrtreasure5mo

There have been comments from OAI staff that o1 is "GPT-2 level" so I wonder if it's a similar size?

The above numbers suggest that (as long as sample efficiency doesn’t significantly improve) the world will always have enough compute to produce at least 23 million token-equivalents per second from any model that the world can afford to train (end-to-end, chinchilla-style). Notably, these are many more token-equivalents per second than we currently have human-AI-researcher-seconds per second. (And the AIs would have the further advantage of having much faster serial speeds.)
So once an AI system trained end-to-end can produce similarly much value per token

Bogdan Ionut Cirstea5mo31

I'm very uncertain and feel somewhat out of depth on this. I do have quite some hope though from arguments like those in https://aiprospects.substack.com/p/paretotopian-goal-alignment.

(Also, what Thane Ruthenis commented below.)

Bogdan Ionut Cirstea5mo64

I think the general impression of people on LW is that multipolar scenarios and concerns over "which monkey finds the radioactive banana and drags it home" are in large part a driver of AI racing instead of being a potential impediment/solution to it. Individuals, companies, and nation-states justifiably believe that whichever one of them accesses potentially superhuman AGI first will have the capacity to flip the gameboard at-will, obtain power over the entire rest of the Earth, and destabilize the currently-existing system. Standard game theory explains

... (read more)

2Bogdan Ionut Cirstea5mo

(Also, what Thane Ruthenis commented below.)

I'm envisioning something like: scary powerful capabilities/demos/accidents leading to various/a coalition of other countries asking the US (and/or China) not to build any additional/larger data centers (and/or run any larger training runs), and, if they're scared enough, potentially even threatening various (escalatory) measures, including economic sanctions, blockading the supply of compute/prerequisites to compute, sabotage, direct military strikes on the data centers, etc.

I'm far from an expert on the topic, but I suspect it might not be trivial to hid... (read more)

4Orpheus165mo

Thanks for spelling it out. I agree that more people should think about these scenarios. I could see something like this triggering central international coordination (or conflict). (I still don't think this would trigger the USG to take different actions in the near-term, except perhaps "try to be more secret about AGI development" and maybe "commission someone to do some sort of study or analysis on how we would handle these kinds of dynamics & what sorts of international proposals would advance US interests while preventing major conflict." The second thing is a bit optimistic but maybe plausible.)

Why Don't We Just... Shoggoth+Face+Paraphraser?

Bogdan Ionut Cirstea5mo160

Hot take, though increasingly moving towards lukewarm: if you want to get a pause/international coordination on powerful AI (which would probably be net good, though likely it would strongly depend on implementation details), arguments about risks from destabilization/power dynamics and potential conflicts between various actors are probably both more legible and 'truer' than arguments about technical intent misalignment and loss of control (especially for not-wildly-superhuman AI).

4Orpheus165mo

What kinds of conflicts are you envisioning? I think if the argument is something along the lines of "maybe at some point other countries will demand that the US stop AI progress", then from the perspective of the USG, I think it's sensible to operate under the perspective of "OK so we need to advance AI progress as much as possible and try to hide some of it, and if at some future time other countries are threatening us we need to figure out how to respond." But I don't think it justifies anything like "we should pause or start initiating international agreements." (Separately, whether or not it's "truer" depends a lot on one's models of AGI development. Most notably: (a) how likely is misalignment and (b) how slow will takeoff be//will it be very obvious to other nations that super advanced AI is about to be developed, and (c) how will governments and bureaucracies react and will they be able to react quickly enough.) (Also separately– I do think more people should be thinking about how these international dynamics might play out & if there's anything we can be doing to prepare for them. I just don't think they naturally lead to a "oh, so we should be internationally coordinating" mentality and instead lead to much more of a "we can do whatever we want unless/until other countries get mad at us & we should probably do things more secretly" mentality.)

8Thane Ruthenis5mo

Agreed. I think a type of "stop AGI research" argument that's under-deployed is that there's no process or actor in the world that society would trust with unilateral godlike power. At large, people don't trust their own governments, don't trust foreign governments, don't trust international organizations, and don't trust corporations or their CEOs. Therefore, preventing anyone from building ASI anywhere is the only thing we can all agree on. I expect this would be much more effective messaging with some demographics, compared to even very down-to-earth arguments about loss of control. For one, it doesn't need to dismiss the very legitimate fear that the AGI would be aligned to values that a given person would consider monstrous. (Unlike "stop thinking about it, we can't align it to any values!".) And it is, of course, true.

[anonymous]5mo1411

arguments about risks from destabilization/power dynamics and potential conflicts between various actors are probably both more legible and 'truer'

Say more?

I think the general impression of people on LW is that multipolar scenarios and concerns over "which monkey finds the radioactive banana and drags it home" are in large part a driver of AI racing instead of being a potential impediment/solution to it. Individuals, companies, and nation-states justifiably believe that whichever one of them accesses potentially superhuman AGI first will have the cap... (read more)

Bogdan Ionut Cirstea5mo4-2

Here's a somewhat wild idea to have a 'canary in a coalmine' when it comes to steganography and non-human (linguistic) representations: monitor for very sharp drops in BrainScores (linear correlations between LM activations and brain measurements, on the same inputs) - e.g. like those calculated in Scaling laws for language encoding models in fMRI. (Ideally using larger, more diverse, higher-resolution brain data.)

Thoughts on “AI is easy to control” by Pope & Belrose

Bogdan Ionut Cirstea5mo41

Mostly the same, perhaps a minor positive update on the technical side (basically, from systems getting somewhat stronger - so e.g. closer to automating AI safety research - while still not showing very dangerous capabilities, like ASL-3, prerequisites to scheming, etc.). My views are even more uncertain / unstable on the governance side though, which probably makes my overall p(doom) (including e.g. stable totalitarianism, s-risks, etc.) more like 20% than 5% (I was probably mostly intuitively thinking of extinction risk only when giving the 5% figure a year ago; overall my median probably hasn't changed much, but I have more variance, coming from the governance side).

1Dakara5mo

If it's not a big ask, I'd really like to know your views on more of a control-by-power-hungry-humans side of AI risk. For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don't think I could trust any of the current leading AI labs to use that power fairly. I don't think this lab would voluntarily decide to give up control over it either (intuitively, it would take quite something for anyone to give up such a source of power). Is there anything that can be done to prevent such a scenario?

Why Don't We Just... Shoggoth+Face+Paraphraser?

Proposal part 3: Use Paraphraser: One more complication. The outputs of the Shoggoth? Paraphrase them. For example, after each line of reasoning CoT is complete, pause the Shoggoth and have a different specialized paraphraser model generate 10 syntactically distinct but semantically equivalent ways of saying the same thing. Then pick one at random and replace the original line of reasoning with it. Then boot up the Shoggoth again and have it continue the CoT.

Any thoughts on how much of a (computational) alignment tax this would impose? Related, thoughts on... (read more)

4Nathan Helm-Burger5mo

I think the risk of the paraphraser scheming seems pretty low if it is a pretrained smaller LLM which has been task-fine-tuned on paraphrasing sentences while keeping semantic meaning approximately the same. If you create this ahead of time, and freeze it (use it in inference mode only) during the training of the shoggoth, then any scheming about the shoggoth's purposes would still need to come from the shoggoth. Could the shoggoth learn jailbreaks to get the paraphraser to say more specific things? This seems hard, but possible. To prevent this, you can do a cheap preventative action: take your small base model (e.g. Haiku) and make 100 different fine-tuned versions of it using different datasets and maybe vary some other training parameters. When training the shoggoth, always pick one of the 100 paraphrasers at random for each sentence. That seems like a hard ask for the shoggoth to jailbreak! Especially since it wouldn't know, when generating the sentence of its next CoT step, which of the 100 paraphrasers was going to be chosen to paraphrase the sentence. This adds negligible cost, since the fine-tuning of the smaller model is cheap. You can use something like S-LoRa to swap between the different fine-tuned Haiku versions, so you don't have to load them all into GPU memory at once in order to do the inference.

4Daniel Kokotajlo5mo

If the shoggoth and paraphraser are both same-size models, then arguably you are adding a significant computational cost by having the paraphraser generate 10x paraphrases of each paragraph. But you can probably get away with a significantly smaller paraphraser model. Like maybe you use a 10x smaller paraphraser model that generates 10x paraphrases at first, but then once the system is established you reduce it to 3x paraphrases, meaning your overall increased computational burden is 30%. And then later you train a specialized paraphraser model that's 10x smaller still, and now the burden is only 3%. idk just a guess Paraphraser would definitely have opaque cognition but hopefully it wouldn't be scheming, because it wouldn't have agency skills or be situationally aware. It would just be a fine-tune of a base model. And a dumb/small base model at that. Heck it could even just be a prompted base model.

Why Don't We Just... Shoggoth+Face+Paraphraser?

'China hawk and influential Trump AI advisor Jacob Helberg asserted to Reuters that “China is racing towards AGI," but I couldn't find any evidence in the report to support that claim.' https://x.com/GarrisonLovely/status/1859022323799699474

AFAICT, there seems to quite heavy overlap between the proposal and Daniel's motivation for it and safety case (sketch) #3 in https://alignment.anthropic.com/2024/safety-cases/.

DX rating is the highest priority DoD procurement standard. It lets DoD compel companies, set their own price, skip the line, and do basically anything else they need to acquire the good in question.' https://x.com/hamandcheese/status/1858902373969564047

(screenshot in post from PDF page 39 of https://www.uscc.gov/sites/default/files/2024-11/2024_Annual_Report_to_Congress.pdf)

https://x.com/hamandcheese/status/1858897287268725080

Bogdan Ionut Cirstea5mo320

'🚨 The annual report of the US-China Economic and Security Review Commission is now live. 🚨

Its top recommendation is for Congress and the DoD to fund a Manhattan Project-like program to race to AGI.

Buckle up...'

2Bogdan Ionut Cirstea5mo

4Bogdan Ionut Cirstea5mo

'The report doesn't go into specifics but the idea seems to be to build / commandeer the computing resources to scale to AGI, which could include compelling the private labs to contribute talent and techniques. DX rating is the highest priority DoD procurement standard. It lets DoD compel companies, set their own price, skip the line, and do basically anything else they need to acquire the good in question.' https://x.com/hamandcheese/status/1858902373969564047

Leon Lang5mo294

In the reuters article they highlight Jacob Helberg: https://www.reuters.com/technology/artificial-intelligence/us-government-commission-pushes-manhattan-project-style-ai-initiative-2024-11-19/

He seems quite influential in this initiative and recently also wrote this post:

https://republic-journal.com/journal/11-elements-of-american-ai-supremacy/

Wikipedia has the following paragraph on Helberg:

“ He grew up in a Jewish family in Europe.[9] Helberg is openly gay.[10] He married American investor Keith Rabois in a 2018 ceremony officiated by Sam Altman.”

Might ... (read more)

5Sodium5mo

This chapter on AI follows immediately after the year in review, I went and checked the previous few years' annual reports to see what the comparable chapters were about, they are 2023: China's Efforts To Subvert Norms and Exploit Open Societies 2022: CCP Decision-Making and Xi Jinping's Centralization Of Authority 2021: U.S.-China Global Competition (Section 1: The Chinese Communist Party's Ambitions and Challenges at its Centennial 2020: U.S.-China Global Competition (Section 1: A Global Contest For Power and Influence: China's View of Strategic Competition With the United States) And this year it's Technology And Consumer Product Opportunities and Risks (Chapter 3: U.S.-China Competition in Emerging Technologies) Reminds of when Richard Ngo said something along the lines of "We're not going to be bottlenecked by politicians not caring about AI safety. As AI gets crazier and crazier everyone would want to do AI safety, and the question is guiding people to the right AI safety policies"

4Bogdan Ionut Cirstea5mo

(screenshot in post from PDF page 39 of https://www.uscc.gov/sites/default/files/2024-11/2024_Annual_Report_to_Congress.pdf)

Catching AIs red-handed

And the space of interventions will likely also include using/manipulating model internals, e.g. https://transluce.org/observability-interface, especially since (some kinds of) automated interpretability seem cheap and scalable, e.g. https://transluce.org/neuron-descriptions estimated a cost of < 5 cents / labeled neuron. LM agents have also previously been shown able to do interpretability experiments and propose hypotheses: https://multimodal-interpretability.csail.mit.edu/maia/, and this could likely be integrated with the above. The auto-interp expl

johnswentworth's Shortform

Any thoughts on potential connections with task arithmetic? (later edit: in addition to footnote 2)

Would the prediction also apply to inference scaling (laws) - and maybe more broadly various forms of scaling post-training, or only to pretraining scaling?

2johnswentworth5mo

Some of the underlying evidence, like e.g. Altman's public statements, is relevant to other forms of scaling. Some of the underlying evidence, like e.g. the data wall, is not. That cashes out to differing levels of confidence in different versions of the prediction.