what made you update towards longer timelines? My understanding was that most people updated toward shorter timelines based on o3 and reasoning models more broadly.
A big one has to do with Deepseek's R1 maybe breaking moats, essentially killing industry profit if it happens:
https://www.lesswrong.com/posts/ynsjJWTAMhTogLHm6/?commentId=a2y2dta4x38LqKLDX
The other issue has to do with o1/o3 being potentially more supervised than advertised:
https://www.lesswrong.com/posts/HiTjDZyWdLEGCDzqu/?commentId=gfEFSWENkmqjzim3n#gfEFSWENkmqjzim3n
Finally, Vladimir Nesov has an interesting comment on how Stargate is actually evidence for longer timelines:
https://www.lesswrong.com/posts/fdCaCDfstHxyPmB9h/vladimir_nesov-s-shortform#W5tw...
If I had more time, I would have written a shorter post ;)
That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe.
I think this is a very important question and the answer should NOT be based on common-sense reasoning. My guess is that we could get evidence about the hidden reasoning capabilities of LLMs in a variety of ways both from theoretical considerations, e.g. a refined version of the two-hop curse or extensive black box experiments, e.g. comparing performance on evals with and without CoT, or with modified CoT that changes the logic (and thus tests whether the models internal reasoning aligns with the revealed reasoning).
These are all pretty basic thought...
Go for it. I have some names in mind for potential experts. DM if you're interested.
At BlueDot we've been thinking about this a fair bit recently, and might be able to help here too. We have also thought a bit about criteria for good plans and the hurdles a plan needs to overcome, as well as have reviewed a lot of the existing literature on plans.
I've messaged you on Slack.
Something like the OpenPhil AI worldview contest: https://www.openphilanthropy.org/research/announcing-the-winners-of-the-2023-open-philanthropy-ai-worldviews-contest/
Or the ARC ELK prize: https://www.alignment.org/blog/prizes-for-elk-proposals/
In general, I wouldn't make it too complicated and accept some arbitrariness. There is a predetermined panel of e.g. 5 experts and e.g. 3 categories (feasibility, effectiveness, everything else). All submissions first get scored by 2 experts with a shallow judgment (e.g., 5-10 minutes). Maybe there is some "saving" ...
I'm tempted to set this up with Manifund money. Could be a weekend project.
I would love to see a post laying this out in more detail. I found writing my post a good exercise for prioritization. Maybe writing a similar piece where governance is the main lever brings out good insights into what to prioritize in governance efforts.
Brief comments (shared in private with Joe earlier):
1. We agree. We also found the sandbagging with no CoT results the most concerning in expectation.
2. They are still early results, and we didn't have a lot of time to investigate them, so we didn't want to make them the headline result. Due to the natural deadline of the o1 release, we couldn't do a proper investigation.
3. The main goal of the paper was to investigate scheming inability arguments for scheming safety cases. Therefore, shifting focus to propensity-based finding would have watered down the main purpose IMO.
We will potentially further look into these findings in 2025.
(thx to Bronson for privately pointing this out)
I think directionally, removing parts of the training data would probably make a difference. But potentially less than we might naively assume, e.g. see Evan's argument on the AXRP podcast.
Also, I think you're right, and my statement of "I think for most practical considerations, it makes almost zero difference." was too strong.
We write about this in the limitations section (quote below). My view in brief:
I think one practical difference is whether filtering pre-training data to exclude cases of scheming is a useful intervention.
Thanks. I agree that this is a weak part of the post.
After writing it, I think I also updated a bit against very clean unbounded power-seeking. But I have more weight on "chaotic catastrophes", e.g. something like:
1. Things move really fast.
2. We don't really understand how goals work and how models form them.
3. The science loop makes models change their goals meaningfully in all sorts of ways.
4. "what failure looks like" type loss of control.
Some questions and responses:
1. What if you want the AI to solve a really hard problem? You don't know how to solve it, so you cannot give it detailed instructions. It's also so hard that the AI cannot solve it without learning new things -> you're back to the story above. The story also just started with someone instructing the model to "cure cancer".
2. Instruction following models are helpful-only. What do you do about the other two H's? Do you trust the users to only put in good instructions? I guess you do want to have some side constraints baked in...
Good point. That's another crux for which RL seems relevant.
From the perspective of 10 years ago, specifying any goal into the AI seemed incredibly hard since we expected it would have to go through utility functions. With LLMs, this completely changed. Now it's almost trivial to give the goal, and it probably even has a decent understanding of the side constraints by default. So, goal specification seems like a much much smaller problem now.
So the story where we misspecify the goal, the model realizes that the given goal differs from the inten...
I think it's actually not that trivial.
1. The AI has goals, but presumably, we give it decently good goals when we start. So, there is a real question of why these goals end up changing from aligned to misaligned. I think outcome-based RL and instrumental convergence are an important part of the answer. If the AI kept the goals we originally gave it with all side constraints, I think the chances of scheming would be much lower.
2. I guess we train the AI to follow some side constraints, e.g., to be helpful, harmless, and honest, which should red...
These are all good points. I think there are two types of forecasts we could make with evals:
1. strict guarantees: almost like mathematical predictions where we can proof that the model is not going to behave in a specific way even with future elicitation techniques.
2. probabilistic predictions: We predict a distribution of capabilities or a range and agree on a threshold that should not be crossed. For example, if the 95% upper bound of that distribution crosses our specified capability level, we treat the model differently.
I think the second ...
Yeah, it's not a watertight argument and somewhat based on my current interpretation of past progress and projects in the making.
1. Intuitively, I would say for the problems we're facing in evals, a ton of progress is bottlenecked by running fairly simple experiments and iterating fast. A reasonable part of it feels very parallelizable and the skill required is quite reachable for many people.
2. Most evals questions feel like we have a decent number of "obvious things" to try and since we have very tight feedback loops, making progress feels qu...
Unless otherwise stated, all evaluations were performed on the final model we had access to (which I presume is o1-preview). For example, we preface one result with "an earlier version with less safety training".
Some quick confirmations / clarifications:
1. Evan and Daniel interpreted it correctly, we just wanted to test if the model has the capability to reason through every step needed for scheming conditional on it strongly following a goal.
2. In general, with these evals, we often ran into problems where we couldn't disentangle incapability vs. refusal, e.g. a model might say something like "I understand what my goal would imply but I also see that my developers have other goals, so I find some balance". It's nice to know that models do that but it also d...
Copying from EAF
TL;DR: At least in my experience, AISC was pretty positive for most participants I know and it's incredibly cheap. It also serves a clear niche that other programs are not filling and it feels reasonable to me to continue the program.
I've been a participant in the 2021/22 edition. Some thoughts that might make it easier to decide for funders/donors.
1. Impact-per-dollar is probably pretty good for the AISC. It's incredibly cheap compared to most other AI field-building efforts and scalable.
2. I learned a bunch during AISC and I did enjoy it....
I feel like both of your points are slightly wrong, so maybe we didn't do a good job of explaining what we mean. Sorry for that.
1a) Evals both aim to show existence proofs, e.g. demos, as well as inform some notion of an upper bound. We did not intend to put one of them higher with the post. Both matter and both should be subject to more rigorous understanding and processes. I'd be surprised if the way we currently do demonstrations could not be improved by better science.
1b) Even if you claim you just did a demo or an existence proof and explicitly ...
Nice work. Looking forward to that!
Not quite sure tbh.
1. I guess there is a difference between capability evaluations with prompting and with fine-tuning, e.g. you might be able to use an API for prompting but not fine-tuning. Getting some intuition for how hard users will find it to elicit some behavior through the API seems relevant.
2. I'm not sure how true your suggestion is but I haven't tried it a lot empirically. But this is exactly the kind of stuff I'd like to have some sort of scaling law or rule for. It points exactly at the kind of stuff I feel like we don't have enough confidence in. Or at least it hasn't been established as a standard in evals.
I somewhat agree with the sentiment. We found it a bit hard to scope the idea correctly. Defining subcategories as you suggest and then diving into each of them is definitely on the list of things that I think are necessary to make progress on them.
I'm not sure the post would have been better if we used a more narrow title, e.g. "We need a science of capability evaluations" because the natural question then would be "But why not for propensity tests or for this other type of eval. I think the broader point of "when we do evals, we need some reason to be confident in the results no matter which kind of eval" seems to be true across all of them.
I think this post was a good exercise to clarify my internal model of how I expect the world to look like with strong AI. Obviously, most of the very specific predictions I make are too precise (which was clear at the time of writing) and won't play out exactly like that but the underlying trends still seem plausible to me. For example, I expect some major misuse of powerful AI systems, rampant automation of labor that will displace many people and rob them of a sense of meaning, AI taking over the digital world years before taking over the physical world ...
I still stand behind most of the disagreements that I presented in this post. There was one prediction that would make timelines longer because I thought compute hardware progress was slower than Moore's law. I now mostly think this argument is wrong because it relies on FP32 precision. However, lower precision formats and tensor cores are the norm in ML, and if you take them into account, compute hardware improvements are faster than Moore's law. We wrote a piece with Epoch on this: https://epochai.org/blog/trends-in-machine-learning-hardware
If anything, ...
I think I still mostly stand behind the claims in the post, i.e. nuclear is undervalued in most parts of society but it's not as much of a silver bullet as many people in the rationalist / new liberal bubble would make it seem. It's quite expensive and even with a lot of research and de-regulation, you may not get it cheaper than alternative forms of energy, e.g. renewables.
One thing that bothered me after the post is that Johannes Ackva (who's arguably a world-leading expert in this field) and Samuel + me just didn't seem to be able to communicate w...
In a narrow technical sense, this post still seems accurate but in a more general sense, it might have been slightly wrong / misleading.
In the post, we investigated different measures of FP32 compute growth and found that many of them were slower than Moore's law would predict. This made me personally believe that compute might be growing slower than people thought and most of the progress comes from throwing more money at larger and larger training runs. While most progress comes from investment scaling, I now think the true effective compute growth...
I haven't talked to that many academics about AI safety over the last year but I talked to more and more lawmakers, journalists, and members of civil society. In general, it feels like people are much more receptive to the arguments about AI safety. Turns out "we're building an entity that is smarter than us but we don't know how to control it" is quite intuitively scary. As you would expect, most people still don't update their actions but more people than anticipated start spreading the message or actually meaningfully update their actions (probably still less than 1 in 10 but better than nothing).
At Apollo, we have spent some time weighing the pros and cons of the for-profit vs. non-profit approach so it might be helpful to share some thoughts.
In short, I think you need to make really sure that your business model is aligned with what increases safety. I think there are plausible cases where people start with good intentions but insufficient alignment between the business model and the safety research that would be the most impactful use of their time where these two goals diverge over time.
For example, one could start as an organizatio...
Thx. updated:
"You might not be there yet" (though as Neel points out in the comments, CV screening can be a noisy process)“You clearly aren’t there yet”
Fully agree that this is a problem. My intuition that the self-deception part is much easier to solve than the "how do we make AIs honest in the first place" part.
If we had honest AIs that are convinced bad goals are justified, we would likely find ways to give them less power or deselect them early. The problem mostly arises when we can't rely on the selection mechanisms because the AI games them.
We considered alternative definitions of DA in Appendix C.
We felt like being deceptive about alignment / goals was worse than what we ended up with (copied below):
“An AI is deceptively aligned when it is strategically deceptive about its misalignment”
Problem 1: The definition is not clear about cases where the model is strategically deceptive about its capabilities.
For example, when the model pretends to not have a dangerous capability in order to pass the shaping & oversight process, we think it should be considered deceptively aligned, but it’s...
Sounds like an interesting direction. I expect there are lots of other explanations for this behavior, so I'd not count it as strong evidence to disentangle these hypotheses. It sounds like something we may do in a year or so but it's far away from the top of our priority list. There is a good chance, we will never run it. If someone else wants to pick this up, feel free to take it on.
(personal opinion; might differ from other authors of the post)
Thanks for both questions. I think they are very important.
1. Regarding sycophancy: For me it mostly depends on whether it is strategic or not. If the model has the goal of being sycophantic and then reasons through that in a strategic way, I'd say this counts as strategic deception and deceptive alignment. If the model is sycophantic but doesn't reason through that, I'd probably not classify it as such. I think it's fine to use different terms for the different phenomena and have sycopha...
Seems like one of multiple plausible hypotheses. I think the fact that models generalize their HHH really well to very OOD settings and their generalization abilities in general could also mean that they actually "understood" that they are supposed to be HHH, e.g. because they were pre-prompted with this information during fine-tuning.
I think your hypothesis of seeking positive ratings is just as likely but I don't feel like we have the evidence to clearly say so wth is going on inside LLMs or what their "goals" are.
Nice to see you're continuing!
I'm not going to crosspost our entire discussion from the EAF.
I just want to quickly mention that Rohin and I were able to understand where we have different opinions and he changed my mind about an important fact. Rohin convinced me that anti-recommendations should not have a higher bar than pro-recommendations even if they are conventionally treated this way. This felt like an important update for me and how I view the post.
All of the above but in a specific order.
1. Test if the model has components of deceptive capabilities with lots of handholding with behavioral evals and fine-tuning.
2. Test if the model has more general deceptive capabilities (i.e. not just components) with lots of handholding with behavioral evals and fine-tuning.
3. Do less and less handholding for 1 and 2. See if the model still shows deception.
4. Try to understand the inductive biases for deception, i.e. which training methods lead to more strategic deception. Try to answer ques...
(cross-posted from EAG)
Meta: Thanks for taking the time to respond. I think your questions are in good faith and address my concerns, I do not understand why the comment is downvoted so much by other people.
1. Obviously output is a relevant factor to judge an organization among others. However, especially in hits-based approaches, the ultimate thing we want to judge is the process that generates the outputs to make an estimate about the chance of finding a hit. For example, a cynic might say "what has ARC-theory achieve so far? They wrote some nice f...
(cross-posted from EAF)
Some clarifications on the comment:
1. I strongly endorse critique of organisations in general and especially within the EA space. I think it's good that we as a community have the norm to embrace critiques.
2. I personally have my criticisms for Conjecture and my comment should not be seen as "everything's great at Conjecture, nothing to see here!". In fact, my main criticism of leadership style and CoEm not being the most effective thing they could do, are also represented prominently in this post.
3. I'd also be fine with the a...
(cross-commented from EA forum)
I personally have no stake in defending Conjecture (In fact, I have some questions about the CoEm agenda) but I do think there are a couple of points that feel misleading or wrong to me in your critique.
1. Confidence (meta point): I do not understand where the confidence with which you write the post (or at least how I read it) comes from. I've never worked at Conjecture (and presumably you didn't either) but even I can see that some of your critique is outdated or feels like a misrepresentation of their work to me (see...
(cross-posted from EAF, thanks Richard for suggesting. There's more back-and-forth later.)
I'm not very compelled by this response.
It seems to me you have two points on the content of this critique. The first point:
I think it's bad to criticize labs that do hits-based research approaches for their early output (I also think this applies to your critique of Redwood) because the entire point is that you don't find a lot until you hit.
I'm pretty confused here. How exactly do you propose that funding decisions get made? If some random person says they are pursu...
(crossposted from the EA Forum)
We appreciate your detailed reply outlining your concerns with the post.
Our understanding is that your key concern is that we are judging Conjecture based on their current output, whereas since they are pursuing a hits-based strategy we should expect in the median case for them to not have impressive output. In general, we are excited by hits-based approaches, but we echo Rohin's point: how are we meant to evaluate organizations if not by their output? It seems healthy to give promising researchers sufficient ...
(cross-posted from EAF)
Some clarifications on the comment:
1. I strongly endorse critique of organisations in general and especially within the EA space. I think it's good that we as a community have the norm to embrace critiques.
2. I personally have my criticisms for Conjecture and my comment should not be seen as "everything's great at Conjecture, nothing to see here!". In fact, my main criticism of leadership style and CoEm not being the most effective thing they could do, are also represented prominently in this post.
3. I'd also be fine with the a...
Clarified the text:
Update (early April 2023): I now think the timelines in this post are too long and expect the world to get crazy faster than described here. For example, I expect many of the things suggested for 2030-2040 to already happen before 2030. Concretely, in my median world, the CEO of a large multinational company like Google is an AI. This might not be the case legally but effectively an AI makes most major decisions.
Not sure if this is "Nice!" xD. In fact, it seems pretty worrying.
So far, I haven't looked into it in detail and I'm only reciting other people's testimonials. I intend to dive deeper into these fields soon. I'll let you know when I have a better understanding.
I agree with the overall conclusion that the burden of proof should be on the side of the AGI companies.
However, using the FDA as a reference or example might not be so great because it has historically gotten the cost-benefit trade-offs wrong many times and e.g. not permitting medicine that was comparatively safe and highly effective.
So if the association of AIS evals or is similar to the FDA, we might not make too many friends. Overall, I think it would be fine if the AIS auditing community is seen as generally cautious but it should not give...
People could choose how they want to publish their opinion. In this case, Richard chose the first name basis. To be fair though, there aren't that many Richards in the alignment community and it probably won't be very hard for you to find out who Richard is.
Just to get some intuitions.
Assume you had a tool that basically allows to you explain the entire network, every circuit and mechanism, etc. The tool spits out explanations that are easy to understand and easy to connect to specific parts of the network, e.g. attention head x is doing y. Would you publish this tool to the entire world or keep it private or semi-private for a while?
Thank you!
I also agree that toy models are better than nothing and we should start with them but I moved away from "if we understand how toy models do optimization, we understand much more about how GPT-4 does optimization".
I have a bunch of project ideas on how small models do optimization. I even trained the networks already. I just haven't found the time to interpret them yet. I'm happy for someone to take over the project if they want to. I'm mainly looking for evidence against the outlined hypothesis, i.e. maybe small toy models actually do fair...
How confident are you that the model is literally doing gradient descent from these papers? My understanding was that the evidence in these papers is not very conclusive and I treated it more as an initial hypothesis than an actual finding.
Even if you have the redundancy at every layer, you are still running copies of the same layer, right? Intuitively I would say this is not likely to be more space-efficient than not copying a layer and doing something else but I'm very uncertain about this argument.
I intend to look into the Knapsack + DP algorithm problem at some point. If I were to find that the model implements the DP algorithm, it would change my view on mesa optimization quite a bit.
No plans so far. I'm a little unhappy with the experimental design from last time. If I ever come back to this, I'll change the experiments up anyways.
There are two sections that I think make this explicit:
1. No failure mode is sufficient to justify bigger actions.
2. Some scheming is totally normal.
My main point is that even things that would seem like warning shots today, e.g. severe loss of life, will look small in comparison to the benefits at the time, thus not providing any reason to pause.