One thing that I do after social interactions, especially those which pertain to my work, is to go over all the updates my background processing is likely to make and to question them more explicitly.
This is helpful because I often notice that the updates I’m making aren’t related to reasons much at all. It’s more like “ah they kind of grimaced when I said that, so maybe I'm bad?” or like “they seemed just generally down on this approach, but wait are any of those reasons even new to me? Haven’t I already considered those and decided to do it anyway?” or “...
There are a ton of objective thresholds in here. E.g., for bioweapon acquisition evals "we pre-registered an 80% average score in the uplifted group as indicating ASL-3 capabilities" and for bioweapon knowledge evals "we consider the threshold reached if a well-elicited model (proxying for an "uplifted novice") matches or exceeds expert performance on more than 80% of questions (27/33)" (which seems good!).
I am confused, though, why these are not listed in the RSP, which is extremely vague. E.g., the "detailed capability threshold" in the RSP is "the...
I don't think that most people, upon learning that Anthropic's justification was "other companies were already putting everyone's lives at risk, so our relative contribution to the omnicide was low" would then want to abstain from rioting. Common ethical intuitions are often more deontological than that, more like "it's not okay to risk extinction, period." That Anthropic aims to reduce the risk of omnicide on the margin is not, I suspect, the point people would focus on if they truly grokked the stakes; I think they'd overwhelmingly focus on the threat to their lives that all AGI companies (including Anthropic) are imposing.
Technical progress also has the advantage of being the sort of thing which could make a superintelligence safe, whereas I expect very little of this to come from institutional competency alone.
I have grown pessimistic about our ability to solve the open technical problems even given 100 years of work on them.
Why?
I'm not particularly resolute on this question. But I get this sense when I look at (a) the best agent foundations work that's happened over ~10 years of work on the matter, and (b) the work output of scaling up the number of people working on 'alignment' by ~100x.
For the first, trying to get a better understand of the basic concepts like logical induction and corrigibility and low-impact and ontological updates, while I feel like there's been progress (in timeless decision theory taking a clear step forward in figuring out how think about decision-makers ...
I agree with a bunch of this post in spirit—that there are underlying patterns to alignment deserving of true name—although I disagree about… not the patterns you’re gesturing at, exactly, but more how you’re gesturing at them. Like, I agree there’s something important and real about “a chunk of the environment which looks like it’s been optimized for something,” and “a system which robustly makes the world look optimized for a certain objective, across many different contexts.” But I don't think the true names of alignment will be behaviorist (“as if” des...
...Many accounts of cognition are impossible (eg AIXI, VNM rationality, or anything utilizing utility functions, many AIT concepts), since they include the impossible step of considering all possible worlds. I think people normally consider this to be something like a “God’s eye view” of intelligence—ultimately correct, but incomputable—which can be projected down to us bounded creatures via approximation, but I think this is the wrong sort of in-principle to real-world bridge. Like, it seems to me that intelligence is fundamentally about ~“finding and exploi
By faithful and human-legible, I mean that the model’s reasoning is done in a way that is directly understandable by humans and accurately reflects the reasons for the model’s actions.
Curious why you say this, in particular the bit about “accurately reflects the reasons for the model’s actions.” How do you know that? (My impression is that this sort of judgment is often based on ~common sense/folk reasoning, ie, that we judge this in a similar way to how we might judge whether another person took sensical actions—was the reasoning internally consiste...
I think this is a very important question and the answer should NOT be based on common-sense reasoning. My guess is that we could get evidence about the hidden reasoning capabilities of LLMs in a variety of ways both from theoretical considerations, e.g. a refined version of the two-hop curse or extensive black box experiments, e.g. comparing performance on evals with and without CoT, or with modified CoT that changes the logic (and thus tests whether the models internal reasoning aligns with the revealed reasoning).
These are all pretty basic thought...
I purposefully use these terms vaguely since my concepts about them are in fact vague. E.g., when I say “alignment” I am referring to something roughly like “the AI wants what we want.” But what is “wanting,” and what does it mean for something far more powerful to conceptualize that wanting in a similar way, and what might wanting mean as a collective, and so on? All of these questions are very core to what it means for an AI system to be “aligned,” yet I don’t have satisfying or precise answers for any of them. So it seems more natural to me, at this sta...
I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess. I do think it takes an unusual skillset, though, which is where most of the trouble lives. I.e., I think the pre-paradigmatic skillset requires unusually strong epistemics (because you often need to track for yourself what makes sense), ~creativity (the ability to synthesize new concepts, to generate genuinely novel hypotheses/ideas), good ability to traverse levels of abstraction (connecting details t...
I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess.
To be clear, I wasn't talking about physics postdocs mainly because of raw g. Raw g is a necessary element, and physics postdocs are pretty heavily loaded on it, but I was talking about physics postdocs mostly because of the large volume of applied math tools they have.
The usual way that someone sees footholds on the hard parts of alignment is to have a broad enough technical background that they can se...
This is the sort of thing I find appealing to believe, but I feel at least somewhat skeptical of. I notice a strong emotional pull to want this to be true (as well as an interesting counterbalancing emotional pull for it to not be true).
I don't think I've seen output from the people aspiring in this direction without being visibly quite smart to make me think "okay yeah it seems like it's on track in some sense."
I'd be interested in hearing more explicit cruxes from you about it.
I do think it's plausible than the "smart enough, creative enough, stron...
I think this is right. A couple of follow-on points:
There's a funding problem if this is an important route to progress. If good work is illegible for years, it's hard to decide who to fund, and hard to argue for people to fund it. I don't have a proposed solution, but I wanted to note this large problem.
Einstein did his pre-paradigmatic work largely alone. Better collaboration might've sped it up.
LessWrong allows people to share their thoughts prior to having publishable journal articles and get at least a few people to engage.
This makes the difficult pre...
I took John to be arguing that we won’t get a good solution out of this paradigm (so long as the humans doing it aren’t expert at alignment), rather than we couldn’t recognize a good solution if it were proposed.
Separately, I think that recognizing good solutions is potentially pretty fraught, especially the more capable the system we’re outsourcing it to is. Like anything about a proposed solution that we don’t know how to measure or we don’t understand could be exploited, and it’s really hard to tell those failures exist almost definitionally. E.g., a pl...
Jan suggested a similar one (Baraka), but I was going to say Koyaanisqatsi. It’s one of my favorite films; I still feel deeply affected by it. I bring it up here, though, because it does an excellent job of inviting viewers to do original seeing. It’s basically a 90 minute documentary about the world, but it doesn’t feel like it has any agenda. It’s just shot after shot of what this planet is like (the Grand Canyon, a commute to work, a factory farm). It doesn’t shy away from anything, doesn’t feel like it’s grasping at some goal. Just an honest, gentle lo...
The interpretability safety case section seems to make the rather strong claim that the activation of known sabotage-related features is necessary for sabotage to occur (emphasis mine):
[C1.1] The model will be unable to accomplish organizational sabotage (T1) or to undermine oversight (T3) without triggering our feature-based monitoring (M1.1) and a detailed analysis of the session.
Afaict, the way that they establish this link (that a feature activation is necessary for a behavior to occur) is through ablation studies. Basically, if deactiva...
I think I probably agree, although I feel somewhat wary about it. My main hesitations are:
"It's plausible that things could go much faster than this, but as a prediction about what will actually happen, humanity as a whole probably doesn't want things to get incredibly crazy so fast, and so we're likely to see something tamer." I basically agree with that.
I feel confused about how this squares with Dario’s view that AI is "inevitable," and "driven by powerful market forces." Like, if humanity starts producing a technology which makes practically all aspects of life better, the idea is that this will just… stop? I’m sure some people wi...
Thanks, I think you’re right on both points—that the old RSP also didn’t require pre-specified evals, and that the section about Capability Reports just describes the process for non-threshold-triggering eval results—so I’ve retracted those parts of my comment; my apologies for the error. I’m on vacation right now so was trying to read quickly, but I should have checked more closely before commenting.
That said, it does seem to me like the “if/then” relationships in this RSP have been substantially weakened. The previous RSP contained sufficiently much wigg...
In the previous RSP, I had the sense that Anthropic was attempting to draw red lines—points at which, if models passed certain evaluations, Anthropic committed to pause and develop new safeguards. That is, if evaluations triggered, then they would implement safety measures. The “if” was already sketchy in the first RSP, as Anthropic was allowed to “determine whether the evaluation was overly conservative,” i.e., they were allowed to retroactively declare red lines green. Indeed, with such caveats it was difficult for me to see the RSP a...
Indeed, such red lines are now made more implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.
This doesn't seem right to me, though it's possible that I'm misreading either the old or new policy (or both).
Re: predefined evaluations, the old policy nei...
Basically I just agree with what James said. But I think the steelman is something like: you should expect shorter (or no) pauses with an RSP if all goes well, because the precautions are matched to the risks. Like, the labs aim to develop safety measures which keep pace with the dangers introduced by scaling, and if they succeed at that, then they never have to pause. But even if they fail, they're also expecting that building frontier models will help them solve alignment faster. I.e., either way the overall pause time would probably be shorter?
It does s...
Does the category “working/interning on AI alignment/control” include safety roles at labs? I’d be curious to see that statistic separately, i.e., the percentage of MATS scholars who went on to work in any role at labs.
Similarly in Baba is You: when people don't have a crisp understanding of the puzzle, they tend to grasp and straws and motivatedly-reason their way into accepting sketchy sounding premises. But, the true solution to a level often feels very crisp and clear and inevitable.
A few of the scientists I’ve read about have realized their big ideas in moments of insight (e.g., Darwin for natural selection, Einstein for special relativity). My current guess about what’s going on is something like: as you attempt to understand a concept you don’t already ...
I'm somewhat confused about when these evaluations are preformed (i.e., how much safety training the model has undergone). OpenAI's paper says: "Red teamers had access to various snapshots of the model at different stages of training and mitigation maturity starting in early August through mid-September 2024," so it seems like this evaluation was probably performed several times. Were these results obtained only prior to safety training or after? The latter seems more concerning to me, so I'm curious.
...You actually want evaluators to have as much skin in the game as other employees so that when they take actions that might shut the company down or notably reduce the value of equity, this is a costly signal.
Further, it's good if evaluators are just considered normal employees and aren't separated out in any specific way. Then, other employees at the company will consider these evaluators to be part of their tribe and will feel some loyalty. (Also reducing the chance that risk evaluators feel like they are part of rival "AI safety" tribe.) This probably ha
Thanks for writing this! I think it’s important for AI labs to write and share their strategic thoughts; I appreciate you doing so. I have many disagreements, but I think it’s great that the document is clear enough to disagree with.
You start the post by stating that “Our ability to do our safety work depends in large part on our access to frontier technology,” but you don’t say why. Like, there’s a sense in which much of this plan is predicated on Anthropic needing to stay at the frontier, but this document doesn’t explain why this is the right ...
I disagree. It would be one thing if Anthropic were advocating for AI to go slower, trying to get op-eds in the New York Times about how disastrous of a situation this was, or actually gaming out and detailing their hopes for how their influence will buy saving the world points if everything does become quite grim, and so on. But they aren’t doing that, and as far as I can tell they basically take all of the same actions as the other labs except with a slight bent towards safety.
Like, I don’t feel at all confident that Anthropic’s credit has exceeded their...
Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn’t include commitments. This is because the science is in early stages and best practices will need to evolve.
I think it’s really alarming that this safety framework contains no commitments, and I’m frustrated that concern about this is brushed aside. If DeepMind is aiming to build AGI soon, I think it’s reasonable to expect they have a mitigation plan more concrete than a list of vague IOU’s. For example, consider this description of a plan from the FSF:
...When a mo
I'm sympathetic to how this process might be exhausting, but at an institutional level I think Anthropic (and all labs) owe humanity a much clearer image of how they would approach a potentially serious and dangerous situation with their models. Especially so, given that the RSP is fairly silent on this point, leaving the response to evaluations up to the discretion of Anthropic. In other words, the reason I want to hear more from employees is in part because I don't know what the decision process inside of Anthropic will look like if an evaluation indicat...
I will also note what I feel is a somewhat concerning trend. It's happened many times now that I've critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: "this wouldn't seem so bad if you knew what was happening behind the scenes."
I just wanted to +1 that I am also concerned about this trend, and I view it as one of the things that I think has pushed me (as well as many others in the community) to lose a l...
Largely agree with everything here.
But, I've heard some people be concerned "aren't basically all SSP-like plans basically fake? is this going to cement some random bureaucratic bullshit rather than actual good plans?." And yeah, that does seem plausible.
I do think that all SSP-like plans are basically fake, and I’m opposed to them becoming the bedrock of AI regulation. But I worry that people take the premise “the government will inevitably botch this” and conclude something like “so it’s best to let the labs figure out what to do before cemen...
Afaict, the current world we’re in is basically the worst case scenario
the status quo is not, imo, a remotely acceptable alternative either
Both of these quotes display types of thinking which are typically dangerous and counterproductive, because they rule out the possibility that your actions can make things worse.
The current world is very far from the worst-case scenario (even if you have very high P(doom), it's far away in log-odds) and I don't think it would be that hard to accidentally make things considerably worse.
Or independent thinkers try to find new frames because the ones on offer are insufficient? I think this is roughly what people mean when they say that AI is "pre-paradigmatic," i.e., we don't have the frames for filling to be very productive yet. Given that, I'm more sympathetic to framing posts on the margin than I am to filling ones, although I hope (and expect) that filling-type work will become more useful as we gain a better understanding of AI.
I think this letter is quite bad. If Anthropic were building frontier models for safety purposes, then they should be welcoming regulation. Because building AGI right now is reckless; it is only deemed responsible in light of its inevitability. Dario recently said “I think if [the effects of scaling] did stop, in some ways that would be good for the world. It would restrain everyone at the same time. But it’s not something we get to choose… It’s a fact of nature… We just get to find out which world we live in, and then deal with it as best we can.” Bu...
I agree that scoring "medium" seems like it would imply crossing into the medium zone, although I think what they actually mean is "at most medium." The full quote (from above) says:
In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place for the relevant postmitigation risk level to be back at most to “medium” level.
I.e., I think what they...
Maybe I'm missing the relevant bits, but afaict their preparedness doc says that they won't deploy a model if it passes the "medium" threshold, eg:
Only models with a post-mitigation score of "medium" or below can be deployed. In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place.
The threshold for further developing is set to "high," though. I.e., they can further develop so long as models don't hit the "critical" threshold.
But you could just as easily frame this as "abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments".
But doesn't this argument hold with the opposite conclusion, too? E.g. "abstract reasoning about unfamiliar domains is hard therefore you should distrust arguments about good-by-default worlds".
But whether an organization can easily respond is pretty orthogonal to whether they’ve done something wrong. Like, if 80k is indeed doing something that merits a boycott, then saying so seems appropriate. There might be some debate about whether this is warranted given the facts, or even whether the facts are right, but it seems misguided to me to make the strength of an objection proportional to someone’s capacity to respond rather than to the badness of the thing they did.
This comment appears to respond to habryka, but doesn’t actually address what I took to be his two main points—that Anthropic was using NDAs to cover non-disparagement agreements, and that they were applying significant financial incentive to pressure employees into signing them.
We historically included standard non-disparagement agreements by default in severance agreements
Were these agreements subject to NDA? And were all departing employees asked to sign them, or just some? If the latter, what determined who was asked to sign?
Agreed. I'd be especially interested to hear this from people who have left Anthropic.
I do kind of share the sense that people mostly just want frisbee and tea, but I am still confused about it. Wasn't religion a huge deal for people for most of history? I could see a world where they were mostly just going through the motions, but the amount of awe I feel going into European churches feels like some evidence against this. And it's hard for me to imagine that people were kind of just mindlessly sitting there through e.g., Gregorian chanting in candlelight, but maybe I am typical minding too hard. It really seems like these rituals, the arch...
Right now it seems like the entire community is jumping to conclusions based on a couple of "impressions" people got from talking to Dario, plus an offhand line in a blog post. With that little evidence, if you have formed strong expectations, that's on you.
Like Robert, the impressions I had were based on what I heard from people working at Anthropic. I cited various bits of evidence because those were the ones available, not because they were the most representative. The most representative were those from Anthropic employees who concurred that this...
I have not heard from anyone who wasn’t released, and I think it is reasonably likely I would have heard from them anonymously on Signal. Also, not releasing a bunch of people after saying they would seems like an enormously unpopular, hard to keep secret, and not very advantageous move for OpenAI, which is already taking a lot of flak for this.
I’m not necessarily imagining that OpenAI failed to release a bunch of people, although that still seems possible to me. I’m more concerned that they haven’t released many key people, and while I agree that yo...
I imagine many of the people going into leadership positions were prepared to ignore the contract, or maybe even forgot about the nondisparagement clause
I could imagine it being the case that people are prepared to ignore the contract. But unless they publicly state that, it wouldn’t ameliorate my concerns—otherwise how is anyone supposed to trust they will?
...The clause is also open to more avenues of legal attack if it's enforced against someone who takes another position which requires disparagement (e.g. if it's argued to be a restriction on engaging in b
I agree this is usually the case, but I think it’s not always true, and I don’t think it’s necessarily true here. E.g., people as early as Da Vinci guessed that we’d be able to fly long before we had planes (or even any flying apparatus which worked). Because birds can fly, and so we should be able to as well (at least, this was Da Vinci and the Wright brothers' reasoning). That end point was not dependent on details (early flying designs had wings like a bird, a design which we did not keep :p), but was closer to a laws of physics claim (if birds can do i...
There's a pretty big difference between statements like "superintelligence is physically possible", "superintelligence could be dangerous" and statements like "doom is >80% likely in the 21st century unless we globally pause". I agree with (and am not objecting to) the former claims, but I don't agree with the latter claim.
I also agree that it's sometimes true that endpoints are easier to predict than intermediate points. I haven't seen Eliezer give a reasonable defense of this thesis as it applies to his doom model. If all he means here is that superin...
Bloomberg confirms that OpenAI has promised not to cancel vested equity under any circumstances, and to release all employees from one-directional non-disparagement agreements.
They don't actually say "all" and I haven't seen anyone confirm that all employees received this email. It seems possible (and perhaps likely) to me that many high profile safety people did not receive this email, especially since it would presumably be in Sam's interest to do so, and since I haven't seen them claiming otherwise. And we wouldn't know: those who are still under the co...
Why do you think this? The power that I'm primarily concerned about is the power to pause, and I'm quite skeptical that companies like Amazon and Google would be willing to invest billions of dollars in a company which may decide to do something that renders their investment worthless. I.e, I think a serious pause, one on the order of months or years, is essentially equivalent to opting out of the race to AGI. On this question, my strong prior is that investors like Google and Amazon have more power than employees or the trust, else they wouldn't invest.
"So God can’t make the atoms be arranged one way and the humans be arranged another contradictory way."
But couldn't he have made a different sort of thing than humans, which were less prone to evil? Like, it seems to me that he didn't need to make us evolve through the process of natural selection, such that species were always in competition, status was a big deal, fighting over mates commonplace, etc. I do expect that there's quite a bit of convergence in the space of possible minds—even if one is selecting them from the set of "all possible atomic confi...
Secondly, following Dennett, the point of modeling cognitive systems according to the intentional stance is that we evaluate them on a behavioral basis and that is all there is to evaluate.
I am confused on this point. Several people have stated that Dennett believes something like this, e.g., Quintin and Nora argue that Dennett is a goal "reductionist," by which I think they mean something like "goal is the word we use to refer to certain patterns of behavior, but it's not more fundamental than that."
But I don't think Dennett believes this. He's ...
I don't know what Katja thinks, but for me at least: I think AI might pose much more lock-in than other technologies. I.e., I expect that we'll have much less of a chance (and perhaps much less time) to redirect course, adapt, learn from trial and error, etc. than we typically do with a new technology. Given this, I think going slower and aiming to get it right on the first try is much more important than it normally is.
I agree there other problems the EA biosecurity community focuses on, but surely lab escapes are one of those problems, and part of the reason we need biosecurity measures? In any case, this disagreement seems beside the main point that I took Adam to be making, namely that the track record for defining appropriate units of risk for poorly understood, high attack surface domains is quite bad (as with BSL). This still seems true to me.
Dennett meant a lot to me, in part because he’s shaped my thinking so much, and in part because I think we share a kindred spirit—this ardent curiosity about minds and how they might come to exist in a world like ours. I also think he is an unusually skilled thinker and writer in many respects, as well as being an exceptionally delightful human. I miss him.
In particular, I found his deep and persistent curiosity beautiful and inspiring, especially since it’s aimed at all the (imo) important questions. He has a clarity of thought which manages to be b...
Aw man, this is so exciting! There’s something really important to me about rationalist virtues having a home in the world. I’m not sure if what I’m imagining is what you’re proposing, exactly, but I think most anything in this vicinity would feel like a huge world upgrade to me.
Apparently I have a lot of thoughts about this. Here are some of them, not sure how applicable they are to this project in particular. I think you can consider this to be my hopes for what such a thing might be like, which I suspect shares some overlap.
It has felt to me for a few y...
Huh, I feel confused. I suppose we just have different impressions. Like, I would say that Oliver is exceedingly good at cutting through the bullshit. E.g., I consider his reasoning around shutting down the Lightcone offices to be of this type, in that it felt like a very straightforward document of important considerations, some of which I imagine were socially and/or politically costly to make. One way to say that is that I think Oliver is very high integrity, and I think this helps with bullshit detection: it's easier to see how things don't cut to the ...
This is very cool! I’m excited to see where it goes :)
A couple questions (mostly me grappling with what the implications of this work might be):
Thanks!
My high-level skepticism of their approach is A) I don't buy that it's possible yet to know how dangerous models are, nor that it is likely to become possible in time to make reasonable decisions, and B) I don't buy that Anthropic would actually pause, except under a pretty narrow set of conditions which seem unlikely to occur.
As to the first point: Anthropic's strategy seems to involve Anthropic somehow knowing when to pause, yet as far as I can tell, they don't actually know how they'll know that. Their scaling policy does not list the tests they'll run,... (read more)