One of the assumptions guiding the analysis here is that sticker prices will approach marginal costs in a competitive market. DeepSeek recently released data about their production inference cluster (or at least one of them). If you believe their numbers, they report theoretical (assuming no discounts and assuming use of the more expensive model) daily revenue of $562,027, with a cost profit margin of 545%. DeepSeek is one of, if not the, lowest price providers for the DeepSeek-R1 and DeepSeek-V3 models. So this data indicates that even the relatively chea... (read more)
Hm, sorry, I did not mean to imply that the defense/offense ratio is infinite. It's hard to know, but I expect it's finite for the vast majority of dangerous technologies[1]. I do think there are times where the amount of resources and intelligence needed to do defense are too high and a civilization cannot do them. If an astroid were headed for earth 200 years ago, we simply would not have been able to do anything to stop it. Asteroid defense is not impossible in principle — the defensive resources and intelligence needed are not infinite — but they are c... (read more)
I think your discussion for why humanity could survive a misaligned superintelligence is missing a lot. Here are a couple claims:
When there are ASIs in the world, we will see ~100 years of technological progress in 5 years (or like, what would have taken humanity 100 years in the absence of AI). This will involve the development of many very lethal technologies.
The aligned AIs will fail to defend the world against at least one of those technologies.
Why do I believe point 2? It seems like the burden of proof is really high to say that "nope, ev... (read more)
See my response to ryan_greenblatt (don't know how to link comments here). You claim is that the defense/offense ratio is infinite. I don't know why this would have been the case.
Crucially I am not saying that we are guaranteed to end up in a good place, or that superhuman unaligned ASIs cannot destroy the world. Just that if they are completely dominated (so not like the nuke ratio of US and Russia but more like US and North Korea) then we should be able to keep them at bay.
5Knight Lee
I think there's a spectrum of belief regarding AGI power and danger.
There are people optimistic about AGI (but worry about bad human users):
* Eric Drexler (“Reframing Superintelligence” + LLMs + 4 years)
* A Solution for AGI/ASI Safety
* This post
They often think the "good AGI" will keep the "bad AGI" in check. I really disagree with that because
* The "population of AGI" is nothing like the population of humans, it is far more homogeneous because the most powerful AGI can just copy itself until it takes over most of the compute. If we fail to align them, different AGI will end up misaligned for the same reason.
* Eric Drexler envisions humans equipped with AI services acting as the good AGI. But having a human controlling enough decisions to ensure alignment will slow things down.
* If the first ASI is bad, it may build replicating machines/nanobots.
There are people who worry about slow takeoff risks:
* Redwood
* This comment by Buck
* Ryan Greenblatt's comment above "winning a war against a rogue AI seems potentially doable, including a rogue AI which is substantially more capable than humans"
* Dan Hendrycks's views on AGI selection pressure
* I think Anthropic's view is here
* Eric Drexler again (Applying superintelligence without collusion)
* It looks like your comment is here
They are worried about "Von Neumann level AGI," which poses a threat to humanity because they can build mirror bacteria and threaten humanity into following their will. The belief is that the war between it and humanity will be drawn out and uncertain, there may be negotiations.
They may imagine good AGI and bad AGI existing at the same time, but aren't sure the good ones will win. Dan Hendryck's view is the AGI will start off aligned, but humanity may become economically dependent on it and fall for its propaganda until it evolves into misalignment.
Finally, there are people who worry about fast takeoff risks:
* The Case Against AI Control Research
* M
I was surprised to see you choose to measure faithfulness using the setup from Chua et al. and Turpin et al. rather than Lanham et al. IMO, the latter is much better, albeit is restricted in that you have to do partial pre-filling of model responses (so you might be constrained on what models you can do it on, but it should be possible on QwQ). I would guess this is partially for convenience reasons, as you already have a codebase that works and you're familiar with, and partially because you think this is a better setup. Insofar as you think this is a bet... (read more)
thanks!
i'm away from work now so i cant write much. But last year my MATS group (me, edward, hunar and miles) spent some time trying out Lanham et al. tests (two months?).
Its not easy to use for a few reasons. And I dont think results are easy to interpret. Here are my thoughts. May remember better ones later. Hopefully there can be an improved version of lanham's methods.
Recap:
Lanham's method truncates CoT or adds a mistake. It then forces the model to give an answer straightaway.
* One big assumption the causality of the cot. This means that in whatever situation you have where the cot is not causal, you can never "prove" faithfulness. So, you need to search for questions which the model can solve with cot, but cannot solve in non-cot setup. For example, its impossible to prove the faithfulness of questions like "10 + 10", because the model can solve it in a non-cot fashion. This requirement of casuality seems undesirable -- you want to prove faithfulness even in areas where the model can do it in non-cot. If my LLM is scheming and lying to me about "10 + 10", i still want to be able to prove its faithfulness!
* as an implementation detail it can be annoying to implement because models have different non-cot capabilities.
* Adding a mistake to the CoT, then forcing the model to answer is quite an out of distribution behavior. What kind of mistake do you want to add? It seems that sonetimes models ignore models as part of their normal behavior. Maybe they would even articulate that "Oh actually i made a mistake, and am going to ignore it and not change my answer". But because we force the model to give an answer straightaway, the model has no chance to articulate what it is doing. And since the model does not switch its answer, we deem the CoT as "unfaithful" even though it would have articulate it under normal situations.
I like this blog post. I think this plan has a few problems, which you mention, e.g., Potential Problem 1, getting the will and oversight to enact this domestically, getting the will and oversight/verification to enact this internationally.
There's a sense in which any plan like this that coordinates AI development and deployment to a slower-than-ludicrous rate seems like it reduces risk substantially. To me it seems like most of the challenge comes from getting to a place of political will from some authority to actually do that (and in the internati... (read more)
Thanks very much! Yeah, I agree political will seems like a big issue. But I also hear people saying that they don't know what to push for, so I wanted to try to offer a concrete example of a system that wasn't as destructive to any constituency's interests as e.g. a total pause.
I would like Anthropic to prepare for a world where the core business model of scaling to higher AI capabilities is no longer viable because pausing is needed. This looks like having a comprehensive plan to Pause (actually stop pushing the capabilities frontier for an extended period of time, if this is needed). I would like many parts of this plan to be public. This plan would ideally cover many aspects, such as the institutional/governance (who makes this decision and on what basis, e.g., on the basis of RSP), operational (what happens), and business (ho... (read more)
I believe this is standard/acceptable for presenting log-axis data, but I'm not sure. This is a graph from the Kaplan paper:
It is certainly frustrating that they don't label the x-axis. Here's a quick conversation where I asked GPT4o to explain. You are correct that a quick look at this graph (where you don't notice the log-scale) would imply (highly surprising and very strong) linear scaling trends. Scaling laws are generally very sub-linear, in particular often following a power-law. I don't think they tried to mislead about this, instead this is a domai... (read more)
From the o1 blog post (evidence about the methodology for presenting results but not necessarily the same):
o1 greatly improves over GPT-4o on challenging reasoning benchmarks. Solid bars show pass@1 accuracy and the shaded region shows the performance of majority vote (consensus) with 64 samples.
What do people mean when they say that o1 and o3 have "opened up new scaling laws" and that inference-time compute will be really exciting?
The standard scaling law people talk about is for pretraining, shown in the Kaplan and Hoffman (Chinchilla) papers.
It was also the case that various post-training (i.e., finetuning) techniques improve performance, (though I don't think there is as clean of a scaling law, I'm unsure). See e.g., this paper which I just found via googling fine-tuning scaling laws. See also the Tülu 3 paper, Figure 4.
Maybe a dumb question, but those log scale graphs have uneven ticks on the x axis, is there a reason they structured it like that beyond trying to draw a straight line? I suspect there is a good reason and it's not dishonesty but this does look like something one would do if you wanted to exaggerate the slope
o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute. Way faster than pretraining paradigm of new model every 1-2 years
o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the s
GPT-4o costs $10 per 1M output tokens, so the cost of $60 per 1M tokens is itself more than 6 times higher than it has to be. Which means they can afford to sell a much more expensive model at the same price. It could also be GPT-4.5o-mini or something, similar in size to GPT-4o but stronger, with knowledge distillation from full GPT-4.5o, given that a new training system has probably been available for 6+ months now.
I appreciate your point about compelling experimental evidence, and I think it's important that we're currently at a point with very little of that evidence. I still feel a lot of uncertainty here, and I expect the evidence to basically always be super murky and for interpretations to be varied/controversial, but I do feel more optimistic than before reading your comment.
You could find a way of proving to the world that your AI is aligned, which other labs can't replicate, giving you economic advantage.
It seems quite different to the ESG case. Customers don't personally benefit from using a company with good ESG. They will benefit from using an aligned AI over a misaligned one.
Again though, customers currently have no selfish reason to care.
It's quite common for only a very small number of ppl to have the individual ability to verify a safety case, but many more to defer to their judgement. People may defer to an AISI, or a regulatory agency.
I agree it's plausible. I continue to think that defensive strategies are harder than offensive ones, except the ones that basically look like centralized control over AGI development. For example,
Provide compelling experimental evidence that standard training methods lead to misaligned power-seeking AI by default
Then what? The government steps in and stops other companies from scaling capabilities until big safety improvements have been made? That's centralization along many axes. Or maybe all the other key decision makers in AGI projects get convin... (read more)
Quick clarification on terminology. We've used 'centralised' to mean "there's just one project doing pre-training". So having regulations that enforce good safety practice or gate-keep new training runs don't count. I think this is a more helpful use of the term. It directly links to the power concentration concerns we've raised. I think the best versions of non-centralisation will involve regulations like these but that's importantly different from one project having sole control of an insanely powerful technology.
Compelling experimental evidence
Currently there's no basically no empirical evidence that misaligned power-seeking emerges by default, let alone scheming. If we got strong evidence that scheming happens by default then I expect that all projects would do way more work to check for and avoid scheming, whether centralised or not. Attitudes change on all levels: project technical staff, technical leadership, regulators, open-source projects.
You can also iterate experimentally to understand the conditions that cause scheming, allowing empirical progress on scheming like was never before possible.
This seems like a massive game changer to me. I truly believe that if we picked one of today's top-5 labs at random and all the others were closed, this would be meaningfully less likely to happen and that would be a big shame.
Scalable alignment solution
You're right there's IP reasons against sharing. I believe it would be in line with many company's missions to share, but they may not. Even so, there's a lot you can do with aligned AGI. You could use it to produce compelling evidence about whether other AIs are aligned. You could find a way of proving to the world that your AI is aligned, which other labs can't replicate, giving you economic advantage. It would be interesting to explore threats models where AI takes over despite a project solving this, and it doesn't seem crazy, but i'd predict that we'd conclude the odds are better than if there'
Training as it's currently done needs to happen within a single cluster
I think that's probably wrong, or at least effectively wrong. Gemini 1.0, trained a year ago has the following info in the technical report:
TPUv4 accelerators are deployed in “SuperPods” of 4096 chips... TPU accelerators primarily communicate over the high speed inter-chip-interconnect, but at Gemini Ultra scale, we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network (Poutievski et al., 2022; Wetherall et al., 2023; yao Hong et al., 2018).
This might require bandwidth of about 300 Tbps for 500K B200s systems (connecting their geographically distributed parts), based on the below estimate. It gets worse with scale.
The "cluster" label applied in this context might be a bit of a stretch, for example the Llama 3 24K H100s cluster is organized in pods of 3072 GPUs, and the pods themselves are unambiguously clusters, but at the top level they are connected with 1:7 oversubscription (Section 3.3.1).
Only averaged gradients need to be exchanged at the top level, once at each optimizer step (minibatch). Llama 3 405B has about 1M minibatches with about 6 seconds per step[1], which means latency doesn't matter, only bandwidth. I'm not sure what precision is appropriate for averaging gradients, but at 4 bytes per weight that's 1.6TB of data to be sent each way in much less than 6 seconds, say in 1 second. This is bandwidth of 12 Tbps, which fits in what a single fiber of a fiber optic cable can transmit. Overland cables are laid with hundreds of fibers, so datacenters within the US can probably get at least one fiber of bandwidth between them.
Overly large minibatches are bad for quality of training, and with H100s in a standard setup only 8 GPUs are within NVLink scaleup domains that enable tensor parallelism. If each token sequence is processed on 8 GPUs (at a given stage of pipeline parallelism), that makes it necessary to process 2K sequences at once (Llama 3 only uses 16K GPUs in its training), and with 8K tokens per sequence that's our 16M tokens per minibatch, for 1M minibatches[2]. But if scaleup domains were larger and enabled more tensor parallelism (for an appropriately large model), there would be fewer sequences processed simultaneously for smaller minibatches, so the time between optimizer steps would decrease, from Llama 3 405B's 6 seconds down to less than that, making the necessary gradient communication bandwidth higher.
Some B200s come as NVL72 machines with 72 GPUs per scaleup domain. And
While writing, I realized that this sounds a bit similar to the unilateralist's curse. It's not the same, but it has parallels. I'll discuss that briefly because it's relevant to other aspects of the situation. The unilateralist's curse does not occur specifically due to multiple samplings, it occurs because different actors have different beliefs about the value/disvalue, and this variance in beliefs makes it more likely that one of those actors has a belief above the "do it" threshold. If each draw from the AGI urn had the same outcome, this would look a... (read more)
Thanks for writing this, I think it's an important topic which deserves more attention. This post covers many arguments, a few of which I think are much weaker than you all state. But more importantly, I think you all are missing at least one important argument. I've been meaning to write this up, and I'll use this as my excuse.
TL;DR: More independent AGI efforts means more risky “draws” from a pool of potential good and bad AIs; since a single bad draw could be catastrophic (a key claim about offense/defense), we need fewer, more controlled projects... (read more)
I agree with Rose's reply, and would go further. I think there are many actions that just one responsible lab could take that would completely change the game board:
* Find and share a scalable solution to alignment
* Provide compelling experimental evidence that standard training methods lead to misaligned power-seeking AI by default
* Develop and share best practices for responsible scaling that are both commercially viable and safe.
You comment argues that "one bad apple spoils the bunch", but it's also plausible that "one good apple saves the bunch"
3rosehadshar
Thanks, I agree this is an important argument.
Two counterpoints:
* The more projects you have, the more attempts at alignment you have. It's not obvious to me that more draws are net bad, at least at the margin of 1 to 2 or 3.
* I'm more worried about the harms from a misaligned singleton than from a misaligned (or multiple misaligned) systems in a wider ecosystem which includes powerful aligned systems.
0Aaron_Scher
While writing, I realized that this sounds a bit similar to the unilateralist's curse. It's not the same, but it has parallels. I'll discuss that briefly because it's relevant to other aspects of the situation. The unilateralist's curse does not occur specifically due to multiple samplings, it occurs because different actors have different beliefs about the value/disvalue, and this variance in beliefs makes it more likely that one of those actors has a belief above the "do it" threshold. If each draw from the AGI urn had the same outcome, this would look a lot like a unilateralist's curse situation where we care about variance in the actors' beliefs. But I instead think that draws from the AGI urn are somewhat independent and the problem is just that we should incur e.g., a 5% misalignment risk as few times as we have to.
Interestingly, a similar look at variance is part of what makes the infosecurity situation much worse for multiple projects compared to centralized AGI project: variance is bad here. I expect a single government AGI project to care about and invest in security at least as much as the average AGI company. The AGI companies have some variance in their caring and investment in security, and the lower ones will be easier to steal from. If you assume these multiple projects have similar AGI capabilities (this is a bad assumption but is basically the reason to like multiple projects for Power Concentration reasons so worth assuming here; if the different projects don't have similar capabilities, power is not very balanced), you might then think that any of the companies getting their models stolen is similarly bad to the centralized project getting its models stolen (with a time lag I suppose, because the centralized project got to that level of capability faster).
If you are hacking a centralized AGI project, say you have a 50% chance of success. If you are hacking 3 different AGI projects, you have 3 different/independent 50% chances of success. Th
Noting that I spent a couple minutes pondering the quoted passage which I don't think was a good use of time (I basically would have immediately dismissed it if I knew Claude wrote it, and I only thought about it because my prior on Buck saying true things is way higher), and I would have preferred the text not have this.
I don't see anybody having mentioned it yet, but the recent paper about LLM Introspection seems pretty relevant. I would say that a model which performs very well at introspection (as defined there) would be able to effectively guess which jailbreak strategies were attempted.
There is now some work in that direction: https://forum.effectivealtruism.org/posts/47RH47AyLnHqCQRCD/soft-nationalization-how-the-us-government-will-control-ai
Some prompts I found interesting when brainstorming LLM startups
I spent a little time thinking about making an AI startup. I generally think it would be great if more people were trying to build useful companies that directly add value, rather than racing to build AGI. Here are some of the prompts I found interesting to think about, perhaps they will be useful to other people/AI agents interested in building a startup:
What are the situations where people will benefit from easy and cheap access to expert knowledge? You’re leveraging that human expert labor
I think there's a large area in journalism where there's a lot of data and an LLM-driven model could write a good story.
Any law that's considered by congress before it's passed or regulation could be the basis for an article. The model could read through all the comments that were made in the public comment process by various lobbyists and other interested parties and synthesis them into a pro&con.
I think it's possible that such article could be less be lot less partisan than current mainstream media and explain the important features of laws that a journalists that spends a few hours for the issue just doesn't get.
Besides public comments for laws and regulations, I would expect that there are some similar topic where there's a lot of public information that currently no one condenses into one post that can be easily read.
2Viliam
I think it could make sense to combine artificial intelligence with expert domain knowledge. The expert describes the process step by step, providing detailed instructions for the AI at each step. The AI does the process with the customer. The expert reviews the logs, notices what went wrong, and updates the instructions accordingly.
AI is the power that allows the solution to scale, and expert knowledge is the part that will make you different from your competitors. The AI multiplies the expert's reach. Many of your competitors will probably try just using the AI, and will achieve worse results. Even the ones who start doing the same thing one year later will be at a disadvantage, if you used that time to improve your AI instructions.
For example, imagine a tutoring website with an AI. How could it be better than opening a chat and asking a generic AI to explain a topic? For starters, on the front page, you would see a list of topics. For example, you choose math, and you see a list of math topics that are taught at school, arranged by years. (By "a list of topics" I mean something like a grid of colorful icons. By the way, those icons can also be AI generated, but approved by a human.) For each topic, an expert would specify the set of things that need to mentioned, common misconceptions that need to be checked, etc. The AI would do the dialog with the student. (For example, if the topic is quadratic equations, the expert would specify that you need to solve an equation with two solutions, an equation with one solution, and an equation with no solutions. Or that the AI should ask whether you know about complex numbers, verify whether you actually do, and depending on that maybe mention that "no solutions" actually means two complex solutions.)
Or, imagine a tool that helps wannabe authors create stories. How could it be better than a generic "hey AI, make me a story about this and that"? For example, you could have a workflow where the AI asks about the size of
Yep, I basically agree with those being the cruxes!
On how much of the goal reasoning is happening out loud: Nowadays, I think about a lot of AI safety research as being aimed at an AI Control scenario where we are closely supervising what models are thinking about in CoT, and thus malign goal seeking must either happen in individual forward passes (and translated to look harmless most the time) or in an obfuscated way in CoT. (or from rogue deployments outside the control scheme)
Studying the intrinsic goals of the models in naturalistic settings would also
I really appreciate the thoughtful replies and feedback—I also didn't read any of your comments as rude or mean! I'd like to clarify a few points about our approach and its relevance:
We expect that good CoT reasoning helps current models better pursue and achieve goals. In the future, all of the reasoning needed to do this effectively might happen in the forward pass. However, we think it's likely that AGI could be achieved with models similar to current SOTA models, where sophisticated reasoning for goal pursuit will still need to happen in CoT. Even if this bet is wrong and models can do this reasoning entirely in the forward pass, our evals should still reveal goal-directed behavior because we assess actions the model takes that are instrumental to achieving the goal (not just the sensibility of the reasoning traces).
Regarding AI control scenarios, we understand that with good oversight methods, deceptive goal-pursuit would need to happen either in the forward pass or through obfuscated reasoning in CoT. While our scoring of reasoning traces may become less relevant as models improve at internal reasoning, our other metrics focus on how well the model carries out steps and decides on instrumental actions that lead to goal achievement. We believe this serves as a good proxy for goal-directedness, even if we can't directly observe how intrinsic goals arise or manifest from the training process.
A potential risk scenario we're considering is one where a highly capable model is developed that doesn't show obvious signs of misalignment in standard tests (e.g. typical use or general reasoning benchmarks) but has strong goal-pursuit capabilities. Without evaluations that specifically probe goal-directedness, such a model could potentially develop deceptive goals while becoming adept at hiding them. Our evals aim to surface whether models have the ability to perform the necessary reasoning to competently pursue goals, by measuring their capability to carry out the
Some quick thoughts (only skimmed the post, writing quickly), as you asked for feedback:
It looks like the main thing you're testing is some variant of "when prompted to do goal directed behavior, how effective is the model at satisfying the goal?" That's a reasonable thing to investigate, but I'm not sure it would be near the top of the list of "empirical research on goal-directed-ness that I want to see". I'm probably mainly interested in the deceptive alignment motivation, read the rest of this comment as focusing on that.
Thank you for the detailed feedback, I found this very helpful and not at all rude or mean!
I suspect there are a few key disagreements between us that make me more optimistic about this project setup than you. I'd be curious about whether you agree on these points being important cruxes:
* Though I agree that our work primarily belongs within the model organisms paradigm, I disagree that it's only useful as a means to study in-forward-pass goal-directedness. I think there's a considerable chance that the Translucent Thoughts hypotheses are true and AGI will be reached with an architecture very similar to current transformers, which I wouldn't expect to be able to do dangerous goal-directed reasoning within a single forward pass due to the limited number of serial reasoning steps. If this is the case, CoT-based goal-directedness seems relevant to study for its own sake. We're still studying an artificial scenario in the sense that we're giving the models the goal ourselves, while in the deceptive alignment scenarios, the model would have to develop the goal on its own, but I don't think we necessarily have to say anything about in-forward-pass goal-directedness for our work to be applicable to realistic threat models.
* Studying the intrinsic goals of the models in naturalistic settings would also be interesting, but I expect this to be hard, and also less relevant to the threat models we're worried about. I'm mainly concerned about models becoming more goal-directed with scale and someone figuring out long-horizon RL training on top of that, leading to models that are qualitatively different from current ones. To address this, I'd like to have a benchmark that decomposes goal-directedness into several relevant abilities and provides an accurate score for these abilities across model generations. This would help us better track how models become more goal-directed as they scale, if at all. Additionally, if a new model scores close to humans on many of these abil
What's the evidence that this document is real / written by Anthropic?
This sentence seems particularly concerning:
We believe the first two issues can be addressed by focusing on deterrence rather than pre-harm enforcement: instead of deciding what measures companies should take to prevent catastrophes (which are still hypothetical and where the ecosystem is still iterating to determine best practices), focus the bill on holding companies responsible for causing actual catastrophes.
Axios first reported on the letter, quoting from it but not sharing it directly:
https://www.axios.com/2024/07/25/exclusive-anthropic-weighs-in-on-california-ai-bill
The public link is from the San Francisco Chronicle, which is also visible in the metadata on the page citing the letter as “Contributed by San Francisco Chronicle (Hearst Newspapers)”.
https://www.sfchronicle.com/tech/article/wiener-defends-ai-bill-tech-industry-criticism-19596494.php
1RobertM
I don't know the full chain of provenance for the document, given how I received it (linked by someone in a Slack server), but I don't have any specific reason to think it's fake. Seems like a lot of effort to go through for not much obvious gain. But it does seem worth keeping that hypothesis in mind, or similar (i.e. it is Anthropic's letter but it was modified by 3rd parties before being published), absent an explicit confirmation or denial.
Nice work, these seem like interesting and useful results!
High level question/comment which might be totally off: one benefit of having a single, large, SAE neuron space that each token gets projected into is that features don't get in each other's way, except insofar as you're imposing sparsity. Like, your "I'm inside a parenthetical" and your "I'm attempting a coup" features will both activate in the SAE hidden layer, as long as they're in the top k features (for some sparsity). But introducing switch SAEs breaks that: if these two features are in ... (read more)
Thanks for your comment! I believe your concern was echoed by Lee and Arthur in their comments and is completely valid. This work is primarily a proof-of-concept that we can successfully scale SAEs by directly applying MoE, but I suspect that we will need to make tweaks to the architecture.
Leaving Dangling Questions in your Critique is Bad Faith
Note: I’m trying to explain an argumentative move that I find annoying and sometimes make myself; this explanation isn’t very good, unfortunately.
Example
Them: This effective altruism thing seems really fraught. How can you even compare two interventions that are so different from one another?
Explanation of Example
I think the way the speaker poses the above question is not as a stepping stone for actually answering the question, it’s simply as a way to cast doubt on effective altruists. My ... (read more)
I think it’s worth asking why people use dangling questions.
In a fun, friendly debate setting, dangling questions can be a positive contribution. It gives them an opportunity to demonstrate competence and wit with an effective rejoinder.
In a potentially litigious setting, framing critiques as questions (or opinions), rather than as statements of fact, protect you from being convicted of libel.
There are situations where it’s suspicious that a piece of information is missing or not easily accessible, and asking a pointed dangling question seems appropriate to me in these contexts. For certain types of questions, providing answers is assigned to a particular social role, and asking a dangling question can be done to challenge to their competence or integrity. If the question-asker answered their own question, it would not provide the truly desired information, which is whether the party being asked is able to supply it convincingly.
Sometimes, asking dangling questions is useful in its own right for signaling the confidence to criticize or probing a situation to see if it’s safe to be critical. Asking certain types of questions can also signal one’s identity, and this can be a way of providing information (“I am a critic of Effective Altruism, as you can see by the fact that I’m asking dangling questions about whether it’s possible to compare interventions on effectiveness”).
In general, I think it’s interesting to consider information exchange as a form of transaction, and to ask whether a norm is having a net benefit in terms of lowering those transactions costs. IMO, discourse around the impact of rhetoric (like this thread) is beneficial on net. It creates a perception that people are trying to be a higher-trust community and gets people thinking about the impact of their language on other people.
On the other hand, I think actually refereeing rhetoric (ie complaining about the rhetoric rather than the substance in an actual debate context) is sometimes qu
I agree that repeated training will change the picture somewhat. One thing I find quite nice about the linked Epoch paper is that the range of tokens is an order of magnitude, and even though many people have ideas for getting more data (common things I hear include "use private platform data like messaging apps"), most of these don't change the picture because they don't move things more than an order of magnitude, and the scaling trends want more orders of magnitude, not merely 2x.
Repeated data is the type of thing that plausibly adds an order of magnitude or maybe more.
The point is that you need to get quantitative in these estimates to claim that data is running out, since it has to run out compared to available compute, not merely on its own. And the repeated data argument seems by itself sufficient to show that it doesn't in fact run out in this sense.
Data still seems to be running out for overtrained models, which is a major concern for LLM labs, so from their point of view there is indeed a salient data wall that's very soon going to become a problem. There are rumors of synthetic data (which often ambiguously gesture at post-training results while discussing the pre-training data wall), but no published research for how something like that improves the situation with pre-training over using repeated data.
I sometimes want to point at a concept that I've started calling The Scaling Picture. While it's been discussed at length (e.g., here, here, here), I wanted to give a shot at writing a short version:
The picture:
We see improving AI capabilities as we scale up compute, projecting the last few years of progress in LLMs forward might give us AGI (transformative economic/political/etc. impact similar to the industrial revolution; AI that is roughly human-level or better on almost all intellectual tasks) later this decade (note: the picture is not about specific
Data is running out for making overtrained models, not Chinchilla-optimal models, because you can repeat data (there's also a recent hour-long presentation by one of the authors). This systematic study was published only in May 2023, though the Galactica paper from Nov 2022 also has a result to this effect (see Figure 6). The preceding popular wisdom was that you shouldn't repeat data for language models, so cached thoughts that don't take this result into account are still plentiful, and also it doesn't sufficiently rescue highly overtrained models, so the underlying concern still has some merit.
As you repeat data more and more, the Chinchilla multiplier of data/parameters (data in tokens divided by number of active parameters for an optimal use of given compute) gradually increases from 20 to 60 (see the data-constrained efficient frontier curve in Figure 5 that tilts lower on the parameters/data plot, deviating from the Chinchilla efficient frontier line for data without repetition). You can repeat data essentially without penalty about 4 times, efficiently 16 times, and with any use at all 60 times (at some point even increasing parameters while keeping data unchanged starts decreasing rather than increasing performance). This gives a use for up to 100x more compute, compared to Chinchilla optimal use of data that is not repeated, while retaining some efficiency (at 16x repetition of data). Or up to 1200x more compute for the marginally useful 60x repetition of data.
The datasets you currently see at 15-30T tokens scale are still highly filtered compared to available raw data (see Figure 4). The scale feasible within a few years is about 2e28-1e29 FLOPs) (accounting for hypothetical hardware improvement and larger datacenters of early 2030s; this is physical, not effective compute). Chinchilla optimal compute for a 50T token dataset is about 8e26 FLOPs, which turns into 8e28 FLOPs with 16x repetition of data, up to 9e29 FLOPs for the barely useful 60x repetit
2Vladimir_Nesov
There was about 5x increase since GPT-3 for dense transformers (see Figure 4) and then there's MoE, so assuming GPT-3 is not much better than the 2017 baseline after anyone seriously bothered to optimize, it's more like 30% per year, though plausibly slower recently.
The relevant Epoch paper says point estimate for compute efficiency doubling is 8-9 months (Section 3.1, Appendix G), about 2.5x/year. Though I can't make sense of their methodology, which aims to compare the incomparable. In particular, what good is comparing even transformers without following the Chinchilla protocol (finding minima on isoFLOP plots of training runs with individually optimal learning rates, not continued pre-training with suboptimal learning rates at many points). Not to mention non-transformers where the scaling laws won't match and so the results of comparison change as we vary the scale, and also many older algorithms probably won't scale to arbitrary compute at all.
(With JavaScript mostly disabled, the page you linked lists "Compute-efficiency in language models" as 5.1%/year (!!!). After JavaScript is sufficiently enabled, it starts saying "3 ÷/year", with a '÷' character, though "90% confidence interval: 2 times to 6 times" disambiguates it. In other places on the same page there are figures like "2.4 x/year" with the more standard 'x' character for this meaning.)
For jailbreaking you are trying to learn the policy "Always imitate/generate-from a harmless assistant", here you are trying to learn "Always imitate safe human". In both, your model has some prior for outputting harmful next tokens, the context provides an update toward a higher probability of outputting harmful text (because of seeing previous examples of the assistant doing so, or because the previous generations came from an AI). And in both cases we would like some trai... (read more)
Cool! I'm not very familiar with the paper so I don't have direct feedback on the content — seems good. But I do think I would have preferred a section at the end with your commentary / critiques of the paper, also that's potentially a good place to try and connect the paper to ideas in AI safety.
It looks like the example you gave pretty explicitly is using “compute” rather than “effective compute”. The point of having the “effective” part is to take into account non compute progress, such as using more optimal N/D ratios. I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.
That said, I haven’t seen any detailed descriptions of how Anthropic is actually measuring/calculating effective compute (iirc they link to a couple papers and the main theme is that you can use training CE loss as a predictor).
This is a reasonable formulation of what "effective compute" could be defined to mean, but is it actually used in this sense in practice, and who uses it like that? Is it plausible it was used when Anthropic was making the claim that "While Claude 3.5 Sonnet represents an improvement in capabilities over our previously released Opus model, it does not trigger the 4x effective compute threshold" that compares a more Chinchilla optimal model to a more overtrained model?
It's an interesting thought, I didn't consider that this sense of "effective compute" could be the intended meaning. I was more thinking about having a compute multiplier measured from perplexity/FLOPs plots of optimal training runs that compare architectures, like in Figure 4 of the Mamba paper, where we can see that Transformer++ (RMSNorm/SwiGLU/etc.) needs about 5 times less compute (2 times less data) than vanilla Transformer to get the same perplexity, so you just multiply physical compute by 5 to find effective compute of Transformer++ with respect to vanilla Transformer. (With this sense of "effective compute", my argument in the grandparent comment remains the same for effective compute as it is for physical compute.)
In particular, this multiplication still makes sense in order to estimate performance for overtrained models with novel architectures, which is why it's not obvious that it won't normally be used like this. So there are two different possible ways of formulating effective compute for overtrained models, which are both useful for different purposes. I was under the impression that simply multiplying by a compute multiplier measured by comparing performance of Chinchilla optimal models of different architectures is how effective compute is usually formulated even for overtrained models, and that the meaning of the other possible formulation that you've pointed out is usually discussed in terms of perplexity or more explicitly Chinchilla optimal models with equivalent performance,
Claude 3.5 Sonnet solves 64% of problems on an internal agentic coding evaluation, compared to 38% for Claude 3 Opus. Our evaluation tests a model’s ability to understand an open source codebase and implement a pull request, such as a bug fix or new feature, given a natural language description of the desired improvement.
...
While Claude 3.5 Sonnet represents an improvement in capabilities over our previously released Opus model, it does not trigger the 4x effective compute threshold at which we will run the full evaluation protocol described in our Respons
I think more to the point is that when deviating from Chinchilla optimality, measuring effective compute becomes misleading, you can span larger increases in effective compute for Chinchilla optimal models by doing a detour through overtrained models. And given the price difference, Claude 3.5 Sonnet is likely more overtrained than Claude 3 Opus.
Let's say we start with a Chinchilla optimal model with N active parameters that trains for 20N tokens using 120N2 FLOPs of compute. We can then train another model with N/3 active parameters for 180N tokens using 360N2 FLOPs of compute, and get approximately the same performance as with the previous model, but we've now made use of 3 times more compute, below the RSP's 4x threshold. Then, we train the next Chinchilla optimal model with 3N active parameters for 60N tokens using 1080N2 FLOPs of compute, an increase by another 3 times, also below the 4x threshold. But only this second step to the new Chinchilla optimal model increases capabilities, and it uses 9x more compute than the previous Chinchilla optimal model.
Can you say more about why you would want this to exist? Is it just that "do auto-interpretability well" is a close proxy for "model could be used to help with safety research"? Or are you also thinking about deception / sandbagging, or other considerations.
Nice! Do you have a sense of the total development (and run-time) cost of your solution? "Actually getting to 50% with this main idea took me about 6 days of work." I'm interested in the person-hours and API calls cost of this.
Hm, can you explain what you mean? My initial reaction is that AI oversight doesn't actually look a ton like this position of the interior where defenders must defend every conceivable attack whereas attackers need only find one successful strategy. A large chunk of why I think these are disanalogous is that getting caught is actually pretty bad for AIs — see here.
Not sure I love this analogy — moving to NYC doesn't seem like that big of a deal —, but I do think it's pretty messed up to be imposing huge social / technological / societal changes on 8 billion of your peers. I expect most of the people building AGI have not really grasped the ethical magnitude of doing this — I think I sort of have, but also I don't build AGI.
Note on something from the superalignment section of Leopold Aschenbrenner's recent blog posts:
Evaluation is easier than generation. We get some of the way “for free,” because it’s easier for us to evaluate outputs (especially for egregious misbehaviors) than it is to generate them ourselves. For example, it takes me months or years of hard work to write a paper, but only a couple hours to tell if a paper someone has written is any good (though perhaps longer to catch fraud). We’ll have teams of expert humans spend a lot of time evaluating every RLHF examp
AIs that do ARA will need to be operating at the fringes of human society, constantly fighting off the mitigations that humans are using to try to detect them and shut them down
Why do you think this? What is the general story you're expecting?
I think it's plausible that humanity takes a very cautious response to AI autonomy, including hunting and shutting down all autonomous AIs — but I don't think the arguments I'm considering justify more than like 70% confidence (I think I'm somewhere around 60%). Some arguments pointing toward "maybe we won't res... (read more)
I appreciate this post. Emphasizing a couple things and providing some other commentary/questions on the paper (as there doesn't seem to be a better top level post for it) (I have not read paper deeply and could be missing things):
I find the Twitter vote brigading to be annoying and slightly bad for collective epistemics. I do not think this paper was particularly good, and it did not warrant the attention it got. (The main flaws IMO are a lack of (empirical) comparison to other methods — except a brief interlude in the appendix; and lack of any benchmarki
I don’t have strong takes, but you asked for feedback.
It seems nontrivial that the “value proposition” of collaborating with this brain-chunk is actually net positive. E.g., if it involved giving 10% of the universe to humanity, that’s a big deal. Though I can definitely imagine where taking such a trade is good.
It would likely help to devise more clarity about why the brain-chunk provides value. Is it because humanity has managed to coordinate to get a vast majority of high performance compute under the control of a single entity and access to compute is ... (read more)
Thank you, I think you pointed out some pretty significant oversights in the plan.
I was hoping that the system only needed to provide value during the period where an AI is expansion towards a superintelligent singleton, and we only really needed to live through that transition. But you're making me realize that even if we could give it a positive-sum trade up to that point, it would rationally defect afterwards unless we had changed its goals on a deep level. And like you say, that sort of requires that the system can solve alignment as it goes. I'd been thinking that by shifting it's trajectory we could permanently alter its behavior even if we're not solving alignment. I still think that it is possible that we could do that, but probably not in ways that matter for our survival, and probably not in ways that would be easy to predict (e.g. by shifting AI to build X before Y, something about building X causes it to gain novel understanding which it then leverages. Probably not very practically useful since we don't know those in advance.)
I have a rough intuition that the ability to survive the transition to superintelligence still seems like it is still gives humanity more of a chance. In the sense that I expect the AI to be much more heavily resource constrained early in its timeline, and gaining compounding advantages as early as possible is much more advantageous; whereas post-superintelligence the value of any resource may be more incremental. But if that's the state of things, we still require a continuous positive-sum relationship without alignment, which feels likely-impossible to me.
I appreciate this comment, especially #3, for voicing some of why this post hasn't clicked for me.
The interesting hypotheses/questions seem to rarely have strong evidence. But I guess this is partially a selection effect where questions become less interesting by virtue of me being able to get strong evidence about them, no use dwelling on the things I'm highly confident about. Some example hypotheses that I would like to get evidence about but which seem unlikely to have strong evidence: Sam Altman is a highly deceptive individual, far more deceptiv... (read more)
Just chiming in that I appreciate this post, and my independent impressions of reading the FSF align with Zach's conclusions: weak and unambitious.
A couple additional notes:
The thresholds feel high — 6/7 of the CCLs feel like the capabilities would be a Really Big Deal in prosaic terms, and ~4 feel like a big deal for x-risk. But you can't say whether the thresholds are "too high" without corresponding safety mitigations, which this document doesn't have. (Zach)
These also seemed pretty high to me, which is concerning given that they are "Level 1". Th... (read more)
Sam Altman and OpenAI have both said they are aiming for incremental releases/deployment for the primary purpose of allowing society to prepare and adapt. Opposed to, say, dropping large capabilities jumps out of the blue which surprise people.
I think "They believe incremental release is safer because it promotes societal preparation" should certainly be in the hypothesis space for the reasons behind these actions, along with scaling slowing and frog-boiling. My guess is that it is more likely than both of those reasons (they have stated it as their ... (read more)
Yeah, "they're following their stated release strategy for the reasons they said motivated that strategy" also seems likely to share some responsibility. (I might not think those reasons justify that release strategy, but that's a different argument.)
2jmh
I wonder if that is actually a sound view though. I just started reading Like War (interesting and seems correct/on target so far but really just starting it). Given the subject area of impact, reaction and use of social media and networking technologies and the general results socially, seems like society generally is not really even yet prepared and adapted for that inovation. If all the fears about AI are even close to getting things right I suspect the "allowing society to prepare and adapt" suggests putting everything on hold, freezing in place, for at least a decade and probably longer.
Altman's and OpenAI's intentions might be towards that stated goal but I think they are basing that approach on how "the smartest people in the room" react to AI and not the general public, or the most opportinistic people in the room.
This might be a dumb question(s), I'm struggling to focus today and my linear algebra is rusty.
Is the observation that 'you can do feature ablation via weight orthogonalization' a new one?
It seems to me like this (feature ablation via weight orthogonalization) is a pretty powerful tool which could be applied to any linearly represented feature. It could be useful for modulating those features, and as such is another way to do ablations to validate a feature (part of the 'how do we know we're not fooling ourselves about our results' toolkit). Does this seem right? Or does it not actually add much?
1. Not sure if it's new, although I haven't seen it used like this before. I think of the weight orthogonalization as just a nice trick to implement the ablation directly in the weights. It's mathematically equivalent, and the conceptual leap from inference-time ablation to weight orthogonalization is not a big one.
2. I think it's a good tool for analysis of features. There are some examples of this in sections 5 and 6 of Belrose et al. 2023 - they do concept erasure for the concept "gender," and for the concept "part-of-speech tag."
My rough mental model is as follows (I don't really know if it's right, but it's how I'm thinking about things):
* Some features seem continuous, and for these features steering in the positive and negative directions work well.
* For example, the "sentiment" direction. Sentiment can sort of take on continuous values, e.g. -4 (very bad), -1 (slightly bad), 3 (good), 7 (extremely good). Steering in both directions works well - steering in the negative direction causes negative sentiment behavior, and in the positive causes positive sentiment behavior.
* Some features seem binary, and for these feature steering in the positive direction makes sense (turn the feature on), but ablation makes more sense than negative steering (turn the feature off).
* For example, the refusal direction, as discussed in this post.
So yeah, when studying a new direction/feature, I think ablation should definitely be one of the things to try.
Thinking about AI training runs scaling to the $100b/1T range. It seems really hard to do this as an independent AGI company (not owned by tech giants, governments, etc.). It seems difficult to raise that much money, especially if you're not bringing in substantial revenue or it's not predicted that you'll be making a bunch of money in the near future.
What happens to OpenAI if GPT-5 or the ~5b training run isn't much better than GPT-4? Who would be willing to invest the money to continue? It seems like OpenAI either dissolves or gets acquired. Were A... (read more)
Um, looking at the scaling curves and seeing diminishing returns? I think this pattern is very clear for metrics like general text prediction (cross-entropy loss on large texts), less clear for standard capability benchmarks, and to-be-determined for complex tasks which may be economically valuable.
Diminishing returns in loss are not diminishing returns in capabilities. And benchmarks tend to saturate, so diminishing returns are baked in if you look at those.
I am not saying that there aren't diminishing returns to scale, but I just haven't seen anything definitive yet.
Yeah, these developments benefit close-sourced actors too. I think my wording was not precise, and I'll edit it. This argument about algorithmic improvement is an argument that we will have powerful open source models (and powerful closed-source models), not that the gap between these will necessarily shrink. I think both the gap and the absolute level of capabilities which are open-source are important facts to be modeling. And this argument is mainly about the latter.
Yeah, I think we should expect much more powerful open source AIs than we have now. I've been working on a blog post about this, maybe I'll get it out soon. Here are what seem like the dominant arguments to me:
Scaling curves show strongly diminishing returns to $ spend: A $100m model might not be that far behind a $1b model, performance wise.
There are numerous (maybe 7) actors in the open source world who are at least moderately competent and want to open source powerful models. There is a niche in the market for powerful open source models, an
I agree with the premise, but not the conclusion of your last point. Any OpenSource development, that will significantly lower the resource requirements can also be used by closed models to just increased their model/training size for the same cost, thus keeping the gap.
The implication of ICL being implicit BI is that the model is locating concepts it already learned in its training data, so ICL is not a new form of learning that has not been seen before.
I'm not sure I follow this. Are you saying that, if ICL is BI, then a model could not learn a fundamentally new concept in context? Can some of the hypotheses not be unknown — e.g., the model's no-context priors are that it's doing wikipedia prediction (50%), chat bot roleplay (40%), or some unknown role (10%). And ICL seems like it could increase the weight on the unknow... (read more)
I think I mean to say this would imply ICL could not be a new form of learning. And yes, it seems more likely that there could be at least some new knowledge getting generated, one way or another. BI implying all tasks have been previously seen feels extreme, and less likely. I've adjusted my wording a bit now.
One of the assumptions guiding the analysis here is that sticker prices will approach marginal costs in a competitive market. DeepSeek recently released data about their production inference cluster (or at least one of them). If you believe their numbers, they report theoretical (assuming no discounts and assuming use of the more expensive model) daily revenue of $562,027, with a cost profit margin of 545%. DeepSeek is one of, if not the, lowest price providers for the DeepSeek-R1 and DeepSeek-V3 models. So this data indicates that even the relatively chea... (read more)