Doesn't SL already learn a full language in which complex thoughts can be expressed? I agree it was brutal, it required >1e24 flop, but still.
I think the RL case might be more analogous to the translation pair case than the "just the encoded text" case. How does that alter the bottom line?
...However, this strongly limits the space of possible aggregated agents. Imagine two EUMs, Alice and Bob, whose utilities are each linear in how much cake they have. Suppose they’re trying to form a new EUM whose utility function is a weighted average of their utility functions. Then they’d only have three options:
- Form an EUM which would give Alice all the cakes (because it weights Alice’s utility higher than Bob’s)
- Form an EUM which would give Bob all the cakes (because it weights Bob’s utility higher than Alice’s)
- Form an EUM which is totally indifferen
I'd guess that the benchmarks which METR uses have enough label noise and other issues (e.g. specification ambiguity) that measuring >=95% reliability isn't meaningful. 80% probably is meaningful.
When analyzing the high reliability regime, I think you'd want to get a ceiling on performance by baselining with human domain experts who are instructed to focus on reliability. (E.g. something like: "continue testing etc until you're very confident the task is completed successfully".) And then you'd compare this humans ceiling to AI performance (for AIs whic...
It looks like the images aren't showing up on LW.
Number cortical neurons != brain size. Orcas have ~2x the number of cortical neurons, but much larger brains. Assuming brain weight is proportional to volume, with human brains being typically 1.2-1.4kg, and orca brains being typically 5.4-6.8kg, orca brains are actually like 6.1/1.3=4.7 times larger than human brains.
I think cortical neurons is a better proxy than brain size and I expect that the relation between cortical neurons and brain size differs substantially between species. (I expect more similarity within a species.)
...My guess is that neuron
Some quick (and relatively minor) notes:
Yes, I was intending my comment to refer to just code at Anthropic. (Otherwise I would talk much more about serious integration lags and lack of compute.)
This makes sense as a crux for the claim "we need philosophical competence to align unboundedly intelligent superintelligences." But, it doesn't make sense for the claim "we need philosophical competence to align general, openended intelligence."
I was thinking of a slightly broader claim: "we need extreme philosophical competence". If I thought we had to use human labor to align wildly superhuman AIs, I would put much more weight on "extreme philosophical competence is needed". I agree that "we need philosophical competence to align any general, openend...
Orcas have about 43 billion cortical neurons - humans have about 21 billion. The orca cortex has 6 times the area of the human cortex, though the neuron density is about 3 times lower.
[...]
My uncertain guess is that, within mammalian brains, scaling matters a lot more for individual intelligence,
This post seems to assume that a 2x increase in brain size is a huge difference (claiming this could plausibly yield +6SD), but a naive botec doesn't support this.
For humans, brain size and IQ are correlated at ~0.3. Brain size has a standard deviation of roughl...
The "extreme philosophical competence" hypothesis is that you need such competence to achieve "seriously aligned" in this sense. It sounds like you disagree, but I don't know why since your reasoning just sidesteps the problem.
Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on "extreme philosophical competence".)
...Looking over the comments of
Yep, just the obvious. (I'd say "much less bought in" than "isn't bought in", but whatever.)
I don't really have dots I'm trying to connect here, but this feels more central to me than what you discuss. Like, I think "alignment might be really, really hard" (which you focus on) is less of the crux than "is misalignment that likely to be a serious problem at all?" in explaining. Another way to put this is that I think "is misalignment the biggest problem" is maybe more of the crux than "is misalignment going to be really, really hard to resolve in some worlds". I see why you went straight to your belief though.
IMO actively torch the "long pause" worlds
Not sure how interesting this is to discuss, but I don't think I agree with this. Stuff they're doing does seem harmful to worlds where you need a long pause, but feels like at the very least Anthropic is a small fraction of the torching right? Like if you think Anthropic is making this less likely, surely they are a small fraction of people pushing in this direction such that they aren't making this that much worse (and can probably still pivot later given what they've said so far).
I don't expect 90% of code in 6 months and more confidently don't expect "almost all" in 12 months for a reasonable interpretation of almost all. However, I think this prediction is also weaker than it might seem, see my comment here.
Dario Amodei says AI will be writing 90% of the code in 6 months and almost all the code in 12 months.
I think it's somewhat unclear how big of a deal this is. In particular, situations where AIs write 90% of lines of code, but are very far (in time, effective compute, and qualitative capabilities) from being able to automate research engineer jobs seem very plausible to me. Perhaps Dario means something a bit stronger than "90% of lines of code".
It's pretty easy to get to 25% of lines of code written by LLMs with very weak models, e.g., Google claims to...
My guess is that the parts of the core leadership of Anthropic which are thinking actively about misalignment risks (in particular, Dario and Jared) think that misalignment risk is like ~5x smaller than I think it is while also thinking that risks from totalitarian regimes are like 2x worse than I think they are. I think the typical views of opinionated employees on the alignment science team are closer to my views than to the views of leadership. I think this explains a lot about how Anthropic operates.
I share similar concerns that Anthropic doesn't seem ...
I think I disagree some with this change. Now I'd say something like "We think the control line-of-defense should mostly focus on the time before we have enough evidence to relatively clearly demonstrate the AI is consistently acting egregiously badly. However, the regime where we deploy models despite having pretty strong empirical evidence that that model is scheming (from the perspective of people like us), is not out of scope."
You could have the view that open weights AGI is too costly on takeover risk and escape is bad, but we'll hopefully have some pre-AGI AIs which do strange misaligned behaviors that don't really get them much/any influence/power. If this is the view, then it really feels to me like preventing escape/rogue internal deployment is pretty useful.
Related question: are you in favor of making AGI open weights? By AGI, I mean AIs which effectively operate autonomously and can acquire automously acquire money/power. This includes AIs capable enough to automate whole fields of R&D (but not much more capable than this). I think the case for this being useful on your views feels much stronger than the case for control preventing warning shots. After all, you seemingly mostly thought control was bad due to the chance it would prevent escape or incidents of strange (and not that strategic) behavior. Nai...
I do understand this line of reasoning, but yes, my intuition differs. For some sort of a weird case study, consider Sydney. [...] My guess is that none of that would have happened with properly implemented control measures.
Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don't really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn't feel like a meaningful representative example.
...I would say some assumptions go into who the ‘we’
I think something like this is a live concern, though I'm skeptical that control is net negative for this reason.
My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn't mitigated as effectively, so that AIs succeed in large...
...I think something like this is a live concern, though I'm skeptical that control is net negative for this reason.
My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn't mitigated as effectively, so that AIs succeed in large
After thinking more about it, I think "we haven't seen evidence of scheming once the octopi were very smart" is a bigger update than I was imagining, especially in the case where the octopi weren't communicating with octopese. So, I'm now at ~20% without octopese and about 50% with it.
Yes, I think frontier AI companies are responsible for most of the algorithmic progress. I think its unclear how much the leading actor benefits from progress done at other slightly behind AI companies and this could make progress substantially slower. (However, it's possible the leading AI company would be able to acquire the GPUs from these other companies.)
My current best guess median is that we'll see 6 OOMs of effective compute in the first year after full automation of AI R&D if this occurs in ~2029 using a 1e29 training run and compute is scaled up by a factor of 3.5x[1] over the course of this year[2]. This is around 5 years of progress at the current rate[3].
How big of a deal is 6 OOMs? I think it's a pretty big deal; I have a draft post discussing how much an OOM gets you (on top of full automation of AI R&D) that I should put out somewhat soon.
Further, my distribution over this is radically u...
I'm not making a strong claim this makes sense and I think people should mostly think about the AI case directly. I think it's just another intuition pump and we can potentially be more concrete in the octopus case as we know the algorithm. (While in the AI case, we haven't seen an ML algorithm that scales to human level.)
I should note that I'm quite uncertain here and I can easily imagine my views swinging by large amounts.
Amusingly, this is actually not exactly what we were discussing and is mostly off-topic for our discussion.
Recently, @Daniel Kokotajlo and I were talking about the probability that AIs trained using "business as usual RLHF" end up being basically aligned rather than conspiring against us and our tests.[1] One intuition pump we ended up discussing is the prospects of octopus misalignment. Overall, my view is that directly considering the case with AIs (and what various plausible scenarios would look like) is more informative than analogies like this, but analogies like this are still somewhat useful to consider.
So, what do I mean by octopus misalignment? Suppose...
I'm not sure about the details of the concrete proposal, but I agree with the spirit of the proposal.
(In particular, I don't know if I think having the "do you consent" text in this way is a good way to do this given limited will. I also think you want to have a very specific signal of asking for consent that you commit to filtering out except when it is actually being used. This is so the AI isn't worried it is in red teaming etc.)
I certainly agree it isn't clear, just my current best guess.
On some axes, but won't there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.
Yes, I just meant on net. (Relative to the current ML community and given a similar fraction of resources to spend on AI compute.)
Oh, yeah I meant "perform well according to your metrics" not "behave well" (edited)
I don't think "what is the necessary work for solving alignment" is a frame I really buy. My perspective on alignment is more like:
who realize AGI is only a few years away
I dislike the implied consensus / truth. (I would have said "think" instead of "realize".)
I think that for people (such as myself) who think/realize timelines are likely short, I find it more truth-tracking to use terminology that actually represents my epistemic state (that timelines are likely short) rather than hedging all the time and making it seem like I'm really uncertain.
Under my own lights, I'd be giving bad advice if I were hedging about timelines when giving advice (because the advice wouldn't be tracking the world as it is, it would be tracking a probability distribution I disagree with and thus a probability distribution that leads...
A typical crux is that I think we can increase our chances of "real alignment" using prosaic and relatively unenlightened ML reasearch without any deep understanding.
I both think:
This top level post is part of Josh's argument for (2).
FWIW, I don't think "data-efficient long-horizon RL" (which is sample efficient in a online training sense) implies you can make faithful simulations.
but you do get a machine that pursues the metric that you fine-tuned it to pursue, even out of distribution (with a relatively small amount of data).
I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well perform well according to your metrics in the average case using online training.
Something important is that "significantly accelerate alignment research" isn't the same as "making AIs that we're happy to fully defer to". This post is talking about conditions needed for deference and how we might achieve them.
(Some) acceleration doesn't require being fully competitive with humans while deference does.
I think AIs that can autonomously do moderate duration ML tasks (e.g., 1 week tasks), but don't really have any interesting ideas could plausibly speed up safety work by 5-10x if they were cheap and fast enough.
To be clear, I think there are important additional considerations related to the fact that we don't just care about capabilities that aren't covered in that section, though that section is not that far from what I would say if you renamed it to "behavioral tests", including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).
But it isn't a capabilities condition? Maybe I would be happier if you renamed this section.
I think there is an important component of trustworthiness that you don't emphasize enough. It isn't sufficient to just rule out alignment faking, we need the AI to actually try hard to faithfully pursue our interests including on long, confusing, open-ended, and hard to check tasks. You discuss establishing this with behavioral testing, but I don't think this is trivial to establish with behavioral testing. (I happen to think this is pretty doable and easier than ruling out reasons why our tests might be misleading, but this seems nonobvious.)
Perhaps you ...
Note that the capability milestone forecasted in the linked short form is substantially weaker than the notion of transformative AI in the 2020 model. (It was defined as AI with an effect at least as large as the industrial revolution.)
I don't expect this adds many years, for me it adds like ~2 years to my median.
(Note that my median for time from 10x to this milestone is lower than 2 years, but median to Y isn't equal to median to X + median from X to Y.)
Maybe? At a very high level, I think the weights tend not to have "goals," in the way that the rollouts tend to have goals.
Sure, I meant natural emerging malign goals to include both "the ai pursues non myopic objectives" and "these objectives weren't intended and some (potentially small) effort was spent trying to prevent this".
(I think AIs that are automating huge amounts of human labor will be well described as pursuing some objective at least within some small context (e.g. trying to write and test a certain piece of software), but this could be well controlled or sufficiently myopic/narrow that the ai doesn't focus on steering the general future situation including its own weights.)
CoT is way more interpretable than I expected, which bumped me up, so if that became uninterpretable naturally that's a big bump down. I think people kinda overstate how likely this is to happen naturally though.
Presumably you'd update toward pessimism a bunch if reasoning in latent vectors aka neuralese was used for the smartest models (instead of natural language CoT) and it looked like this would be a persistant change in architecture?
(I expect that (at least when neuralese is first introduced) you'll have both latent reasoning and natural language ...
Some things are true simply because they are true and in general there's no reason to expect a simpler explanation.
You could believe:
Some things are true simply because they are true, but only when being true isn't very surprising. (For instance, it isn't very surprising that there are some cellular automata that live for 100 steps or that any particular cellular automata lives for 100 steps.)
However, things which are very surprising and don't have a relatively compact explanation are exponentionally rare. And, in the case where something is infinitely surprising (e.g., if the digits of pi weren't normal), there will exist a finite explanation.
(I don't expect o3-mini is a much better agent than 3.5 sonnet new out of the box, but probably a hybrid scaffold with o3 + 3.5 sonnet will be substantially better than 3.5 sonnet. Just o3 might also be very good. Putting aside cost, I think o1 is usually better than o3-mini on open ended programing agency tasks I think.)
The question of context might be important, see here. I wouldn't find 15 minutes that surprising for ~50% success rate, but I've seen numbers more like 1.5 hours. I thought this was likely to be an overestimate so I went down to 1 hour, but more like 15-30 minutes is also plausible.
Keep in mind that I'm talking about agent scaffolds here.
I mean, I don't think AI R&D is a particularly hard field persay, but I do think it involves lots of tricky stuff and isn't much easier than automating some other plausibly-important-to-takeover field (e.g., robotics). (I could imagine that the AIs have a harder time automating philosophy even if they were trying to work on this, but it's more confusing to reason about because human work on this is so dysfunctional.) The main reason I focused on AI R&D is that I think it is much more likely to be fully automated first and seems like it is probably fully automated prior to AI takeover.
How much flop do you think it takes for large scale RL to learn codes? Let's say RL learns 100x less than SL (seems plausible) and is only 10% as focused on learning new ways of thinking / languages as SL. Then, we'd expect that reasonably efficient RL with 3 OOMs more FLOP than big pretraining runs (that do learn this to some extent) could learn new languages. This would naively be a ~1e27 FLOP RL run (assuming that we can learn this much stuff in 1e24 FLOP pretraining runs).
I think we'll probably see 1e27 FLOP RL runs next year?