250 upvotes is also crazy high. Another sign of the disastrous abilities of EA/LessWrong communities at character judgment.
The same is right now happening before our eyes on Anthropic. And similar crowds are as confidently asserting that this time they're really the good guys.
I am somewhat confused about this.
To be clear I am pro people from organizations I think are corrupt showing up to defend themselves, so I would upvote it if it had like 20 karma or less.
I would point out that the comments criticizing the organization’s behavior and character are getting similar vote levels (e.g. top comment calls OpenAI reckless and unwise and 185 karma and 119 agree-vote).
I just skimmed but just wanted to flag that I like Bengio's proposal of one coordinated coalition that develops several AGIs in a coordinated fashion (e.g. training runs at the same time on their own clusters), which decreases the main downside of having one single AGI project (power concentration).
I still agree with a lot of that post and am still essentially operating on it.
I also think that it's interesting to read the comments because at the time the promise of those who thought my post was wrong was that Anthropic's RSP would get better and that this was only the beginning. With RSP V2 being worse and less specific than RSP V1, it's clear that this was overoptimistic.
Now, risk management in AI has also gone a lot more mainstream than it was a year ago, in large parts thanks to the UK AISI who started operating on it. People have also...
I'd be interested in also exploring model-spec-style aspirational documents too.
Happy to do a call on model-spec-style aspirational documents if it's any relevant. I think this is important and we could be interested in helping develop a template for it if Anthropic was interested in using it.
Thanks for writing this post. I think the question of how to rule out risk post capability thresholds has generally been underdiscussed, despite it being probably the hardest risk management question with Transformers. In a recent paper, we coin "assurance properties" the research directions that are helpful for this particular problem.
Using a similar type of thinking applied to other existing safety techniques, it seems to me like interpretability is one of the only current LLM safety directions that can get you a big Bayes factor.
The second o...
This article fails to account for the fact that abiding by the rules suggested would mostly kill the ability of journalists to share the most valuable information they share with the public.
You don't get to reveal stuff from the world most powerful organizations if you double check the quotes with them.
I think journalism is one of the professions where the consequentialist vs deontological ethics have the toughest trade-offs. It's just really hard to abide by very high privacy standards and broke highly important news.
As one illustrative example, your standard would have prevented Kelsey Piper from sharing her conversation with SBF. Is that a desirable outcome? Not sure.
Personally I use a mix of heuristics based on how important the new idea is, how rapid it is and how painful it will be to execute it in the future once the excitement dies down.
The more ADHD you are and the more the "burst of inspired-by-a-new-idea energy" effect is strong, so that should count.
do people have takes on the most useful metrics/KPIs that could give a sense of how good are the monitoring/anti-misuse measures on APIs?
Some ideas:
a) average time to close an account conducting misuse activities (my sense is that as long as this is >1 day, there's little chance to avoid that state actors use API-based models for a lot of misuse (everything which doesn't require major scale))
b) the logs of the 5 accounts/interactions that have been ranked as highest severity (my sense is that incident reporting like OpenAI/Microsoft have done on c...
This looks to be overwhelmingly the most likely in my opinion and I'm glad someone wrote this post. Thanks Buck
Thanks for answering, that's very useful.
My concern is that as far as I understand, a decent number of safety researchers are thinking that policy is the most important area, but because, as you mentioned, they aren't policy experts and don't really know what's going on, they just assume that Anthropic policy work is way better than those actually working in policy judge it to be. I've heard from a surprisingly high number of people among the orgs that are doing the best AI policy work that Anthropic policy is mostly anti-helpful.
Somehow though...
How aware were you (as an employee) & are you (now) of their policy work? In a world model where policy is the most important stuff, it seems to me like it could tarnish very negatively Anthropic's net impact.
I don't quite understand the question. I've heard various bits of gossip, both as an employee and now. I wouldn't say I'm confident in my understanding of any of it. I was somewhat sad about Jack and Dario's public comments about thinking it's too early to regulate (if I understood them correctly), which I also found surprising as I thought they had fairly short timelines, but policy is not at all my area of expertise so I am not confident in this take.
I think it's totally plausible Anthropic has net negative impact, but the same is true for almost any sig...
This is the best alignment plan I've heard in a while.
You are a LessWrong reader, want to push humanity's wisdom and don't know how to do so? Here's a workflow:
See an application of the workflow here: https://www.lesswrong.com/posts/epgCXiv3Yy3qgcsys/you-can-t-predict-a-game-of-pinball?commentId=wjLFhiWWacByqyu6a
Playing catch-up is way easier than pushing the frontier of LLM research. One is about guessing which path others took, the other one is about carving a path among all the possible ideas that could work.
If China stopped having access to US LLM secrets and had to push the LLM frontier rather than playing catch up, how slower would it be at doing so?
My guess is at least >2x and probably more but I'd be curious to get takes.
Great initiative! Thanks for leading the charge on this.
Jack Clark: “Pre-deployment testing is a nice idea but very difficult to implement,” from https://www.politico.eu/article/rishi-sunak-ai-testing-tech-ai-safety-institute/
Thanks for the answer it makes sense.
To be clear I saw it thanks to Matt who did this tweet so credit goes to him: https://x.com/SpacedOutMatt/status/1794360084174410104?t=uBR_TnwIGpjd-y7LqeLTMw&s=19
Lighthaven City for 6.6M€? Worth a look by the Lightcone team.
https://x.com/zillowgonewild/status/1793726646425460738?t=zoFVs5LOYdSRdOXkKLGh4w&s=19
Glad you're keeping your eye out for these things!
It's 8 hours away from the Bay, which all-in is not that different from a plane flight to NY from the Bay, so the location doesn't really help with being where all the smart and interesting people are.
Before we started the Lightcone Offices we did a bunch of interviews to see if all the folks in the bay-area x-risk scene would click a button to move to the Presidio District in SF (i.e. imagine Lightcone team packs all your stuff and moves it for you and also all these other people in the scene move too) and...
Thanks for sharing. It's both disturbing from a moral perspective and fascinating to read.
Very important point that wasn't on my radar. Thanks a lot for sharing.
So first the 85% net worth thing went quite viral several times and made Daniel Kokotajlo a bit of a heroic figure on Twitter.
Then Kelsey Piper's reporting pushed OpenAI to give back Daniel's vested units. I think it's likely that Kelsey used elements from this discussion as initial hints for her reporting and plausible that the discussion sparked her reporting, I'd love to have her confirmation or denial on that.
I'm not gonna lie, I'm pretty crazily happy that a random quick take I wrote 10m on a Friday morning about how Daniel Kokotajlo should get social reward and get partial refunding sparked a discussion that seems to have caused positive effects wayyyy beyond expectations.
Quick takes is an awesome innovation, it allows to post even when one is still partially confused/uncertain about sthg. Given the confusing details of the situation in that case, this wd pbbly not have happened otherwise.
Mhhh, that seems very bad for someone in an AISI in general. I'd guess Jade Leung might sadly be under the same obligations...
That seems like a huge deal to me with disastrous consequences, thanks a lot for flagging.
Right. Thanks for putting the full context. Voluntary commitments refers to the WH commitments which are much narrower than the PF so I think my observation holds.
Agreed. Note that they don't say what Martin claim they say, but they only say
We’ve evaluated GPT-4o according to our Preparedness Framework
I think it's reasonably likely to imply that they broke all their non-evaluation PF commitments, while not being technically wrong.
...We’ve evaluated GPT-4o according to our Preparedness Framework and in line with our voluntary commitments. Our evaluations of cybersecurity, CBRN, persuasion, and model autonomy show that GPT-4o does not score above Medium risk in any of these categories. This assessment involved running a suite of automated and human evaluations throughout the model training process. We tested both pre-safety-mitigation and post-safety-mitigation versions of the model, using custom fine-tuning and prompts, to better elicit model capabilities.
GPT-4o has also und
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this?
I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value.
@Daniel Kokotajlo If you indeed avoided signing an NDA, would you be able to share how much you passed up as a result of that? I might indeed want to create a precedent here and maybe try to fundraise for some substantial fraction of it.
I mean the full option space obviously also includes "bargain with Russia and China to make credible commitments that they stop rearming (possibly in exchange for something)", and I think we should totally explore that path aswell, I just don't have much hope in it at this stage which is why I'm focusing on the other option, even if it is a fucked up local nash equilibrium.
I've been thinking a lot recently about taxonomizing AI risk related concepts to reduce the dimensionality of AI threat modelling while remaining quite comprehensive. It's in the context of developing categories to assess whether labs plans cover various areas of risk.
There are two questions I'd like to get takes on. Any take on one of these 2 wd be very valuable.
Rephrasing based on an ask: "Western Democracies need to urgently put a hard stop to Russia and China war (preparation) efforts" -> Western Democracies need to urgently take actions to stop the current shift towards a new World order where conflicts are a lot more likely due to Western democracies no longer being a hegemonic power able to crush authoritarians power that grab land etc. This shift is currently primarily driven by the fact that Russia & China are heavily rearming themselves whereas Western democracies are not.
@Elizabeth
I liked this extension (https://chrome.google.com/webstore/detail/whispering/oilbfihknpdbpfkcncojikmooipnlglo), which I use for long messages. I press a shortcut, it starts recording with Whisper, then repress and it puts the transcript in my clipboard.
In those, Ukraine committed to pass laws for Decentralisation of power, including through the adoption of the Ukrainian law "On temporary Order of Local Self-Governance in Particular Districts of Donetsk and Luhansk Oblasts". Instead of Decentralization they passed laws forbidding those districts from teaching children in the languages that those districts wants to teach them.
Ukraines unwillingness to follow the agreements was a key reason why the invasion in 2022 happened and was very popular with the Russian population
I ignored that, that's useful,...
Indeed. One consideration is that the LW community used to be much less into policy adjacent stuff and hence much less relevant on that domain. Now, with AI governance becoming an increasingly big deal, I think we could potentially use some of that presence to push for certain things in defense.
Pushing for things in the genre of what Noah describes in the first piece I shared seems feasible for some people in policy.
Idk what the LW community can do but somehow, to the extent we think liberalism is valuable, the Western democracies need to urgently put a hard stop to Russia and China war (preparation) efforts. I fear that rearmament is a key component of the only viable path at this stage.
I won't argue in details here but link to Noahpinion, who's been quite vocal on those topics. The TLDR is that China and Russia have been scaling their war industry preparation efforts for years, while Western democracies industries keep declining and remain crazily dependent from the...
Something which concerns me is that transformative AI will likely be a powerful destabilizing force, which will place countries currently behind in AI development (e.g. Russia and China) in a difficult position. Their governments are currently in the position of seeing that peacefully adhering to the status quo may lead to rapid disempowerment, and that the potential for coercive action to interfere with disempowerment is high. It is pretty clearly easier and cheaper to destroy chip fabs than create them, easier to kill tech employees with potent engineeri...
If you wanna reread the debate, you can scroll through this thread (https://x.com/bshlgrs/status/1764701597727416448).
There was a hot debate recently but regardless, the bottom line is just "RSPs should probably be interpreted literally and nothing else. If a literal statement is not strictly there, it should be assumed it's not a commitment."
I've not seen people doing very literal interpretation on those so I just wanted to emphasize that point.
I currently think Anthropic didn't "explicitly publicly commit" to not advance the rate of capabilities progress. But, I do think they made deceptive statements about it, and when I complain about Anthropic I am complaining about deception, not "failing to uphold literal commitments."
I'm not talking about the RSPs because the writing and conversations I'm talking about came before that. I agree that the RSP is more likely to be a good predictor of what they'll actually do.
I think most of the generator for this was more like "in person conversations", at le...
Given the recent argument on whether Anthropic really did commit to not push the frontier or just misled most people into thinking that it was the case, it's relevant to reread the RSPs in hairsplitting mode. I was rereading the RSPs and noticed a few relevant findings:
Disclaimer: this is focused on negative stuff but does not deny the merits of RSPs etc etc.
There's a number of properties of AI systems that makes it easier to collect information in a safe way about those systems and hence demonstrate their safety: interpretability, formal verifiability, modularity etc. Which adjective wd you use to characterize those properties?
I'm thinking of "resilience" because from the perspective of an AI developer it helps a lot understanding the risk profile, but do you have other suggestions?
Some alternatives:
Unsure how much we disagree Zach and Oliver so I'll try to quantify: I would guess that Claude 3 will cut release date of next gen models from OpenAI by a few months at least (I would guess 3 months), which has significant effects on timelines.
Tentatively, I'm thinking that this effect may be surlinear. My model is that each new release increases the speed of development (bc of increased investment in all the value chain including compute + realization from people that it's not like other technologies etc) and so that a few months now causes more than a few months on AGI timelines.
Oh thanks, I hadn't find it, gonna delete!
Yeah basically Davidad has not only a safety plan but a governance plan which actively aims at making this shift happen!
Thanks for writing that. I've been trying to taboo "goals" because it creates so much confusion, which this post tries to decrease. In line with this post, I think what matters is how difficult a task is to achieve, and what it takes to achieve it in terms of ability to overcome obstacles.
Because it's meaningless to talk about a "compromise" dismissing one entire side of the people who disagree with you (but only one side!).
Like I could say "global compute thresholds is a robustly good compromise with everyone who disagrees with me"
*Footnote: only those who're more pessimistic than me.
That may be right but then the claim is wrong. The true claim would be "RSPs seem like a robustly good compromise with people who are more optimistic than me".
And then the claim becomes not really relevant?
Holden, thanks for this public post.
Regarding
...Responsible scaling policies (RSPs) seem like a robustly good compromise with people who have d
Responsible scaling policies (RSPs) seem like a robustly good compromise with people who have different views from mine
2. It seems like it's empirically wrong based on the strong pushback RSPs received so that at least you shouldn't call it "robustly", unless you mean a kind of modified version that would accommodate the most important parts of the pushback.
FWIW, my read here was that “people who have different views from mine” was in reference to these sets of people:
...
- Some people think that the kinds of risks I’m worried about are far off, farfetched
Would your concerns be mostly addressed if ARC had published a suggestion for a much more comprehensive risk management framework, and explicitly said "these are the principles that we want labs' risk-management proposals to conform to within a few years, but we encourage less-thorough risk management proposals before then, so that we can get some commitments on the table ASAP, and so that labs can iterate in public. And such less-thorough risk management proposals should prioritize covering x, y, z."
Great question! A few points:
Two questions related to it:
Thanks Eli for the comment.
One reason why I haven't provided much evidence is that I think it's substantially harder to give evidence of a "for all" claim (my side of the claim) than a "there exists" (what I ask Evan). I claim that it doesn't happen that a framework on a niche area evolves so fast without accidents based on what I've seen, even in domains with substantial updates, like aviation and nuclear.
I could potentially see it happening with large accidents, but I personally don't want to bet on that and I would want it to be transparent if tha...
Thanks for your comment.
I feel like a lot of the issues in this post are that the published RSPs are not very detailed and most of the work to flesh them out is not done.
I strongly disagree with this. In my opinion, a lot of the issue is that RSPs have been thought from first principles without much consideration for everything the risk management field has done, and hence doing wrong stuff without noticing.
It's not a matter of how detailed they are; they get the broad principles wrong. As I argued (the entire table is about this) I think...
I'm not 100% sure about the second factor but the first is definitely a big factor. There's no institution which is more dense in STEM talent than ENS to my knowledge, and elites there are extremely generalist compared to equivalent elites I've met in other countries like the US (e.g. MIT) for instance. The core of "Classes Préparatoires" is that it pushes even the world best people to grind like hell for 2 years, including weekends, every evenings etc.
ENS is the result of: push all your elite to grind like crazy for 2 years on a range of STEM topics, and then select the top 20 to 50.