IIRC, an Anthropic staff member told me that he had a strong suspicion for why this is, but that it was tied up in proprietary info so he didn't want to say.
I feel like GDM safety and Constellation are similar enough to be in the same cluster: I bet within-cluster variance is bigger than between-cluster variance.
FWIW, I think that the GDM safety people are at least as similar to the Constellation/Redwood/METR cluster as the Anthropic safety people are, probably more similar. (And Anthropic as a whole has very different beliefs than the Constellation cluster, e.g. not having much credence on misalignment risk.)
Hammond was, right?
Ryan agrees, the main thing he means by "behavioral output" is what you're saying: an actually really dangerous action.
I think we should probably say that exploration hacking is a strategy for sandbagging, rather than using them as synonyms.
Isn’t the answer that the low hanging fruit of explaining unexplained observations has been picked?
I appeared on the 80,000 Hours podcast. I discussed a bunch of points on misalignment risk and AI control that I don't think I've heard discussed publicly before.
Transcript + links + summary here; it's also available as a podcast in many places.
What do you think are the most important points that weren't publicly discussed before?
I love that I can guess the infohazard from the comment
A few months ago, I accidentally used France as an example of a small country that it wouldn't be that catastrophic for AIs to take over, while giving a talk in France 😬
No problem, my comment was pretty unclear and I can see from the other comments why you'd be on edge!
It seems extremely difficult to make a blacklist of models in a way that isn't trivially breakable. (E.g. what's supposed to happen when someone adds a tiny amount of noise to the weights of a blacklisted model, or rotates them along a gauge invariance?)
I agree that this isn't what I'd call "direct written evidence"; I was just (somewhat jokingly) making the point that the linked articles are Bayesian evidence that Musk tries to censor, and that the articles are pieces of text.
It is definitely evidence that was literally written
I disagree that you have to believe those four things in order to believe what I said. I believe some of those and find others too ambiguously phrased to evaluate.
Re your model: I think your model is basically just: if we race, we go from 70% chance that US "wins" to a 75% chance the US wins, and we go from a 50% chance of "solving alignment" to a 25% chance? Idk how to apply that here: isn't your squiggle model talking about whether racing is good, rather than whether unilaterally pausing is good? Maybe you're using "race" to mean "not pause" and "not rac...
I have this experience with @ryan_greenblatt -- he's got an incredible ability to keep really large and complicated argument trees in his head, so he feels much less need to come up with slightly-lossy abstractions and categorizations than e.g. I do. This is part of why his work often feels like huge, mostly unstructured lists. (The lists are more unstructured before his pre-release commenters beg him to structure them more.) (His code often also looks confusing to me, for similar reasons.)
Some quick takes:
A few points:
A few takes:
I believe that there is also an argument to be made that the AI safety community is currently very under-indexed on research into future scenarios where assumptions about the AI operator taking baseline safety precautions related to preventing loss of control do not hold.
I think you're mixing up two things: the extent to which we consider the possibility that AI operators will be very incautious, and the extent to which our technical research focuses on that possibility.
My research mostly focuses on techniques that an AI developer could us...
I am not sure I agree with this change at this point. How do you feel now?
We're planning to release some talks; I also hope we can publish various other content from this!
I'm sad that we didn't have space for everyone!
I wrote thoughts here: https://redwoodresearch.substack.com/p/fields-that-i-reference-when-thinking?selection=fada128e-c663-45da-b21d-5473613c1f5c
Alignment Forum readers might be interested in this:
...
Announcing ControlConf: The world’s first conference dedicated to AI control - techniques to mitigate security risks from AI systems even if they’re trying to subvert those controls. March 27-28, 2025 in London. 🧵ControlConf will bring together:
- Researchers from frontier labs & government
- AI researchers curious about control mechanisms
- InfoSec professionals
- Policy researchers
Our goals: build bridges across disciplines, share knowledge, and coordinate research priorities.
The conference will feature
Paul is not Ajeya, and also Eliezer only gets one bit from this win, which I think is insufficient grounds for behaving like such an asshole.
Thanks. (Also note that the model isn't the same as her overall beliefs at the time, though they were similar at the 15th and 50th percentiles.)
the OpenPhil doctrine of "AGI in 2050"
(Obviously I'm biased here by being friends with Ajeya.) This is only tangentially related to the main point of the post, but I think you're really overstating how many Bayes points you get against Ajeya's timelines report. Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.
I don't think you've ever made concrete predictions publicly (which makes me think it's worse behavior for you to criticize people for their predictions), b...
Yudkowsky seems confused about OpenPhil's exact past position. Relevant links:
Here "doctrine" is an applause light; boo, doctrines. I wrote a report, you posted your timeline, they have a doctrine.
All involved, including Yudkowsky, understand that 2050 was a median estimate, not a point estimate. Yudkowsky wrote that it has "very wide credible intervals around both si...
Further detail on this: Cotra has more recently updated at least 5x against her original 2020 model in the direction of faster timelines.
Greenblatt writes:
Here are my predictions for this outcome:
- 25th percentile: 2 year (Jan 2027)
- 50th percentile: 5 year (Jan 2030)
Cotra replies:
My timelines are now roughly similar on the object level (maybe a year slower for 25th and 1-2 years slower for 50th)
This means 25th percentile for 2028 and 50th percentile for 2031-2.
The original 2020 model assigns 5.23% by 2028 and 9.13% | 10.64% by 2031 | 2032 respectively. Each t...
Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.
Just because I was curious, here is the most relevant chart from the report:
This is not a direct probability estimate (since it's about probability of affordability), but it's probably within a factor of 2. Looks like the estimate by 2030 was 7.72% and the estimate by 2036 is 17.36%.
@ryan_greenblatt is working on a list of alignment research applications. For control applications, you might enjoy the long list of control techniques in our original post.
- Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)
I don't know what this means, do you have any examples?
I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".
Some reasons why the “ten people on the inside” might have massive trouble doing even cheap things:
I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.
This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly dis...
How well did this workshop/exercise set go?
Yep, I think that at least some of the 10 would have to have some serious hustle and political savvy that is atypical (but not totally absent) among AI safety people.
What laws are you imagine making it harder to deploy stuff? Notably I'm imagining these people mostly doing stuff with internal deployments.
I think you're overfixating on the experience of Google, which has more complicated production systems than most.
I've talked to a lot of people who have left leading AI companies for reasons related to thinking that their company was being insufficiently cautious. I wouldn't usually say that they'd left "in protest"; for example, most of them haven't directly criticized the companies after leaving.
In my experience, the main reason that most of these people left was that they found it very unpleasant to working there and thought their research would be better elsewhere, not that they wanted to protest poor safety policies per se. I usually advise such people against l...
Many more than two safety-concerned people have left AI companies for reasons related to thinking that those companies are reckless.
Some tweets I wrote that are relevant to this post:
...In general, AI safety researchers focus way too much on scenarios where there's enough political will to adopt safety techniques that are seriously costly and inconvenient. There's a couple reasons for this.
Firstly, AI company staff are disincentivized from making their companies look reckless, and if they give accurate descriptions of the amount of delay that the companies will tolerate, it will sound like they're saying the company is reckless.
Secondly, safety-concerned people outside AI companies feel w
I appreciate the spirit of this type of calculation, but think that it's a bit too wacky to be that informative. I think that it's a bit of a stretch to string these numbers together. E.g. I think Ryan and Tom's predictions are inconsistent, and I think that it's weird to identify 100%-AI as the point where we need to have "solved the alignment problem", and I think that it's weird to use the Apollo/Manhattan program as an estimate of work required. (I also don't know what your Manhattan project numbers mean: I thought there were more like 2.5k scientists/engineers at Los Alamos, and most of the people elsewhere were purifying nuclear material)
I also loved this doc.
IIRC, for most attention heads the max attention is way less than 90%, so my answer is "no". It should be very easy to get someone to make a basic graph of this for you.
Yeah I strongly disagree that this failure mode is very likely to be very hard to mostly resolve. I'm doing some research right now that will hopefully shed some light on this!
My sense is that the control techniques we've talked about (e.g. these, and some other stuff that we've thought of since then) would indeed drastically reduce the risk posed by those Von Neumanns, especially if you run control evaluations with the Von Neumanns.
For example, consider the problem "I want to ask the AI a question that has a numerical answer and that would take a human 8 hours to answer, and I'm worried that occasionally on a subdomain of its choice it will blatantly lie to me". If you do no countermeasures, it seems easy for the AI to cause th...
In your post, you emphasize the slop problem. I think that the slop problem is probably much harder to solve if those AIs are scheming. I guess you're saying that it's just unlikely that the AIs are scheming at the point where you're worried about the slop problem?
Yeah, John's position seems to require "it doesn't matter whether huge numbers of Von Neumann level AGIs are scheming against you", which seems crazy to me.
People will sure be scared of AI, but the arms race pressure will be very strong, and I think that is a bigger consideration
I think criticisms from people without much of a reputation are often pretty well-received on LW, e.g. this one.
I am additionally leery on AI control beyond my skepticism of its value in reducing doom because creating a vast surveillance and enslavement apparatus to get work out of lots and lots of misaligned AGI instances seems like a potential moral horror.
I am very sympathetic to this concern, but I think that when you think about the actual control techniques I'm interested in, they don't actually seem morally problematic except inasmuch as you think it's bad to frustrate the AI's desire to take over.
Note also that it will probably be easier to act cautiously if you don't have to be constantly in negotiations with an escaped scheming AI that is currently working on becoming more powerful, perhaps attacking you with bioweapons, etc!
I think I would feel better about this if control advocates were clear that their strategy is two-pronged, and included somehow getting a pause on ASI development of some kind. Then they would at least be actively trying to make what I would consider one of the most key conditionals for control substantially reducing doom hold.
Idk, I think we're pretty clear that we aren't advocating "do control research and don't have any other plans or take any other actions". For example, in the Implications and proposed actions section of "The case for ensuring that po...
It's usually the case that online conversations aren't for persuading the person you're talking to, they're for affecting the beliefs of onlookers.
I agree with most of this, thanks for saying it. I've been dismayed for the last several years by continuing unreasonable levels of emphasis on interpretability techniques as a strategy for safety.
My main disagreement is that you place more emphasis than I would on chain-of-thought monitoring compared to other AI control methods. CoT monitoring seems like a great control method when available, but I think it's reasonably likely that it won't work on the AIs that we'd want to control, because those models will have access to some kind of "neuralese" that al... (read more)