When do you think would be a good time to lock in regulation? I personally doubt RSP-style regulation would even help, but the notion that now is too soon/risks locking in early sketches, strikes me as in some tension with e.g. Anthropic trying to automate AI research ASAP, Dario expecting ASL-4 systems between 2025—the current year!—and 2028, etc.
Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don't actually have good advice to give anyone.
It seems to me that other possibilities exist, besides "has model with numbers" or "confused." For example, that there are relevant ethical considerations here which are hard to crisply, quantitatively operationalize!
One such consideration which feels especially salient to me is the heuristic that before doing things, one should ideally try to imagine how people would react, upon learning w...
Does your model predict literal worldwide riots against the creators of nuclear weapons? They posed a single-digit risk of killing everyone on Earth (total, not yearly).
It would be interesting to live in a world where people reacted with scale sensitivity to extinction risks, but that's not this world.
The only safety techniques that count are the ones that actually get deployed in time.
True, but note this doesn't necessarily imply trying to maximize your impact in the mean timelines world! Alignment plans vary hugely in potential usefulness, so I think it can pretty easily be the case that your highest EV bet would only pay off in a minority of possible futures.
Prelude to Power is my favorite depiction of scientific discovery. Unlike any other such film I've seen, it adequately demonstrates the inquiry from the perspective of the inquirer, rather than from conceptual or biographical retrospect.
I'm curious if "trusted" in this sense basically just means "aligned"—or like, the superset of that which also includes "unaligned yet too dumb to cause harm" and "unaligned yet prevented from causing harm"—or whether you mean something more specific? E.g., are you imagining that some powerful unconstrained systems are trusted yet unaligned, or vice versa?
I would guess it does somewhat exacerbate risk. I think it's unlikely (~15%) that alignment is easy enough that prosaic techniques even could suffice, but in those worlds I expect things go well mostly because the behavior of powerful models is non-trivially influenced/constrained by their training. In which case I do expect there's more room for things to go wrong, the more that training is for lethality/adversariality.
Given the state of atheoretical confusion about alignment, I feel wary of confidently dismissing these sorts of basic, obvious-at-first-gl...
It seems the pro-Trump Polymarket whale may have had a real edge after all. Wall Street Journal reports (paywalled link, screenshot) that he’s a former professional trader, who commissioned his own polls from a major polling firm using an alternate methodology—the neighbor method, i.e. asking respondents who they expect their neighbors will vote for—he thought would be less biased by preference falsification.
I didn't bet against him, though I strongly considered it; feeling glad this morning that I didn't.
On one hand, I feel a bit skeptical that some dude outperformed approximately every other pollster and analyst by having a correct inside-view belief about how existing pollster were messing up, especially given that he won't share the surveys. On the other hand, this sort of result is straightforwardly predicted by Inadequate Equilibria, where an entire industry had the affordance to be arbitrarily deficient in what most people would think was their primary value-add, because they had no incentive to accuracy (skin in the game), and as soon as someo...
I don't remember anyone proposing "maybe this trader has an edge", even though incentivising such people to trade is the mechanism by which prediction markets work. Certainly I didn't, and in retrospect it feels like a failure not to have had 'the multi-million dollar trader might be smart money' as a hypothesis at all.
Knowing now that he had an edge, I feel like his execution strategy was suspect. The Polymarket prices went from 66c during the order back to 57c on the 5 days before the election. He could have extracted a bit more money from the market if he had forecasted the volume correctly and traded against it proportionally.
I don't know much about religion, but my impression is the Pope disagrees with your interpretation of Catholic doctrine, which seems like strong counterevidence. For example, see this quote:
“All religions are paths to God. I will use an analogy, they are like different languages that express the divine. But God is for everyone, and therefore, we are all God’s children.... There is only one God, and religions are like languages, paths to reach God. Some Sikh, some Muslim, some Hindu, some Christian."
And this one:
...The pluralism and the diversity of religions,
Huh, this doesn't seem clear to me. It's tricky to debate what people used to be imagining, especially on topics where those people were talking past each other this much, but my impression was that the fast/discontinuous argument was that rapid, human-mostly-or-entirely-out-of-the-loop recursive self-improvement seemed plausible—not that earlier, non-self-improving systems wouldn't be useful.
Given both my personal experience with LLMs and my reading of the role that empirical engagement has historically played in non-paradigmatic research, I tend to advocate for a methodology which incorporates immediate feedback loops with present day deep learning systems over the classical "philosophy -> math -> engineering" deconfusion/agent foundations paradigm.
I'm curious what your read of the history is, here? My impression is that most important paradigm-forming work so far has involved empirical feedback somehow, but often in ways exceedingly di...
For what it's worth, as someone in basically the position you describe—I struggle to imagine automated alignment working, mostly because of Godzilla-ish concerns—demos like these do not strike me as cruxy. I'm not sure what the cruxes are, exactly, but I'm guessing they're more about things like e.g. relative enthusiasm about prosaic alignment, relative likelihood of sharp left turn-type problems, etc., than about whether early automated demos are likely to work on early systems.
Maybe you want to call these concerns unserious too, but regardless I do think...
I sympathize with the annoyance, but I think the response from the broader safety crowd (e.g., your Manifold market, substantive critiques and general ill-reception on LessWrong) has actually been pretty healthy overall; I think it's rare that peer review or other forms of community assessment work as well or quickly.
Hendryks had ample opportunity after initial skepticism to remove it, but chose not to.
IMO, this seems to demand a very immediate/sudden/urgent reaction. If Hendrycks ends up being wrong, I think he should issue some sort of retraction (and I think it would be reasonable to be annoyed if he doesn't.)
But I don't think the standard should be "you need to react to criticism within ~24 hours" for this kind of thing. If you write a research paper and people raise important concerns about it, I think you have a duty to investigate them and respond to them, but I...
It's not a full conceptual history, but fwiw Boole does give a decent account of his own process and frustrations in the preface and first chapter of his book.
I just meant there are many teams racing to build more agentic models. I agree current ones aren't very agentic, though whether that's because they're meaningfully more like "tools" or just still too stupid to do agency well or something else entirely, feels like an open question to me; I think our language here (like our understanding) remains confused and ill-defined.
I do think current systems are very unlike oracles though, in that they have far more opportunity to exert influence than the prototypical imagined oracle design—e.g., most have I/O with ~any browser (or human) anywhere, people are actively experimenting with hooking them up to robotic effectors, etc.
I liked Thermodynamic Weirdness for similar reasons. It does the best job of books I've found at describing case studies of conceptual progress—i.e., what the initial prevailing conceptualizations were, and how/why scientists realized they could be improved.
It's rare that books describe such processes well, I suspect partly because it's so wildly harder to generate scientific ideas than to understand them, that they tend to strike people as almost blindingly obvious in retrospect. For example, I think it's often pretty difficult for people familiar with ev...
It's rare that books describe such processes well, I suspect partly because it's so wildly harder to generate scientific ideas than to understand them, that they tend to strike people as almost blindingly obvious in retrospect.
Completely agreed!
I think this is also what makes great history of science so hard: you need to unlearn most of the modern insights and intuitions that didn't exist at the time, and see as close as possible to what the historical actors saw.
This makes me think of a great quote from World of Flows, a history of hydrodynamics:
...There is,
This seems like a great activity, thank you for doing/sharing it. I disagree with the claim near the end that this seems better than Stop, and in general felt somewhat alarmed throughout at (what seemed to me like) some conflation/conceptual slippage between arguments that various strategies were tractable, and that they were meaningfully helpful. Even so, I feel happy that the world contains people sharing things like this; props.
I disagree with the claim near the end that this seems better than Stop
At the start of the doc, I say:
It’s plausible that the optimal approach for the AI lab is to delay training the model and wait for additional safety progress. However, we’ll assume the situation is roughly: there is a large amount of institutional will to implement this plan, but we can only tolerate so much delay. In practice, it’s unclear if there will be sufficient institutional will to faithfully implement this proposal.
Towards the end of the doc I say:
...This plan requires qu
I think the latter group is is much smaller. I'm not sure who exactly has most influence over risk evaluation, but the most obvious examples are company leadership and safety staff/red-teamers. From what I hear, even those currently receive equity (which seems corroborated by job listings, e.g. Anthropic, DeepMind, OpenAI).
What seemed psychologizing/unfair to you, Raemon? I think it was probably unnecessarily rude/a mistake to try to summarize Anthropic’s whole RSP in a sentence, given that the inferential distance here is obviously large. But I do think the sentence was fair.
As I understand it, Anthropic’s plan for detecting threats is mostly based on red-teaming (i.e., asking the models to do things to gain evidence about whether they can). But nobody understands the models well enough to check for the actual concerning properties themselves, so red teamers instead check f...
...My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections. I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine. That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend
Open Philanthropy commissioned five case studies of this sort, which ended up being written by Moritz von Knebel; as far as I know they haven't been published, but plausibly someone could convince him to.
They have in fact been published (it's in your link), at least the ones authors agreed to make publicly available: these are all the case studies, and Moritz von Knebel's write-ups are
Those are great examples, thanks; I can totally believe there exist many such problems.
Still, I do really appreciate ~never having to worry that food from grocery stores or restaurants will acutely poison me; and similarly, not having to worry that much that pharmaceuticals are adulterated/contaminated. So overall I think I currently feel net grateful about the FDA’s purity standards, and net hateful just about their efficacy standards?
What countries are you imagining? I know some countries have more street food, but from what I anecdotally hear most also have far more food poisoning/contamination issues. I'm not sure what the optimal tradeoff here looks like, and I could easily believe it's closer to the norms in e.g. Southeast Asia than the U.S. But it at least feels much less obvious to me than that drug regulations are overzealous.
(Also note that much regulation of things like food trucks is done by cities/states, not the FDA).
Arguments criticizing the FDA often seem to weirdly ignore the "F." For all I know food safety regulations are radically overzealous too, but if so I've never noticed (or heard a case for) this causing notable harm.
Overall, my experience as a food consumer seems decent—food is cheap, and essentially never harms me in ways I expect regulators could feasibly prevent (e.g., by giving me food poisoning, heavy metal poisoning, etc). I think there may be harmful contaminants in food we haven't discovered yet, but if so I mostly don't blame the FDA for that lack of knowledge, and insofar as I do it seems an argument they're being under-zealous.
Criticizing FDA food regulations is a niche; it is hard to criticize 'the unseen', especially when it's mostly about pleasure and the FDA is crying: 'we're saying lives! Won't someone thinking of the children? How can you disagree, just to stuff your face? Shouldn't you be on a diet anyway?'
But if you go looking, you'll find tons of it: pasteurized cheese and milk being a major flashpoint, as apparently the original unpasteurized versions are a lot tastier. (I'm reminded of things like beef tallow for fries or Chipotle - how do you know how good McDonald's...
I agree it seems good to minimize total risk, even when the best available actions are awful; I think my reservation is mainly that in most such cases, it seems really important to say you're in that position, so others don't mistakenly conclude you have things handled. And I model AGI companies as being quite disincentivized from admitting this already—and humans generally as being unreasonably disinclined to update that weird things are happening—so I feel wary of frames/language that emphasize local relative tradeoffs, thereby making it even easier to conceal the absolute level of danger.
- *The rushed reasonable developer regime.* The much riskier regimes I expect, where even relatively reasonable AI developers are in a huge rush and so are much less able to implement interventions carefully or to err on the side of caution.
I object to the use of the word "reasonable" here, for similar reasons I object to Anthropic's use of the word "responsible." Like, obviously it could be the case that e.g. it's simply intractable to substantially reduce the risk of disaster, and so the best available move is marginal triage; this isn't my guess, but I do...
It sounds like you think it's reasonably likely we'll end up in a world with rogue AI close enough in power to humanity/states to be competitive in war, yet not powerful enough to quickly/decisively win? If so I'm curious why; this seems like a pretty unlikely/unstable equilibrium to me, given how much easier it is to improve AI systems than humans.
*The existential war regime*. You’re in an existential war with an enemy and you’re indifferent to AI takeover vs the enemy defeating you. This might happen if you’re in a war with a nation you don’t like much, or if you’re at war with AIs.
Does this seem likely to you, or just an interesting edge case or similar? It's hard for me to imagine realistic-seeming scenarios where e.g. the United States ends up in a war where losing would be comparably bad to AI takeover. This is mostly because ~no functional states (certainly no great powers) strike me as so evi...
We should generally have a strong prior favoring technology in general
Should we? I think it's much more obvious that the increase in human welfare so far has mostly been caused by technology, than that most technologies have net helped humans (much less organisms generally).
I'm quite grateful for agriculture now, but unsure I would have been during the Bronze Age; grateful for nuclear weapons, but unsure how many nearby worlds I'd feel similarly; net bummed about machine guns, etc.
I agree music has this effect, but I think the Fence is mostly because it also hugely influences the mood of the gathering, i.e. of the type and correlatedness of people's emotional states.
(Music also has some costs, although I think most of these aren't actually due to the music itself and can be avoided with proper acoustical treatment. E.g. people sometimes perceive music as too loud because the emitted volume is literally too high, but ime people often say this when the noise is actually overwhelming for other reasons, like echo (insofar as walls/floor...
I appreciate you adding the note, though I do think the situation is far more unusual than described. I agree it's widely priced in that companies in general seek power, but I think probably less so that the author of this post personally works for a company which is attempting to acquire drastically more power than any other company ever, and that much of the behavior the post describes as power-seeking amounts to "people trying to stop the author and his colleagues from attempting that."
Yeah, this omission felt pretty glaring to me. OpenAI is explicitly aiming to build "the most powerful technology humanity has yet invented." Obviously that doesn't mean Richard is wrong that the AI safety community is too power-seeking, but I would sure have appreciated him acknowledging/grappling with the fact that the company he works for is seeking to obtain more power than any group of people in history by a gigantic margin.
An elephant in the room (IMO) is that moving forward, OpenAI probably benefits from a world in which the AI safety community does not have much influence.
There's a fine line between "play nice with others and be more cooperative" and "don't actually advocate for policies that you think would help the world, and only do things that the Big Companies and Their Allies are comfortable with."
Again, I don't think Richard sat in his room and thought "how do I spread a meme that is good for my company." I think he's genuinely saying what he believes and givi...
I agree we might end up in a world like that, where it proves impossible to make a decent safety case. I just think of the ~whole goal of alignment research as figuring out how to avoid that world, i.e. of figuring out how to mitigate/estimate the risk as much/precisely as needed to make TAI worth building.
Currently, AI risk estimates are mostly just verbal statements like "I don't know man, probably some double digit chance of extinction." This is exceedingly unlike the sort of predictably tolerable risk humanity normally expects from its engineering proj...
Maybe I'm just confused what you mean by those words, but where is the disanalogy with safety engineering coming from? That normally safety engineering focuses on mitigating risks with complex causes, whereas AI risk is caused by some sort of scaffolding/bureaucracy which is simpler?
I'm still confused what sort of simplicity you're imagining? From my perspective, the type of complexity which determines the size of the fail surface for alignment mostly stems from things like e.g. "degree of goal stability," "relative detectability of ill intent," and other such things that seem far more complicated than airplane parts.
I agree we don’t currently know how to prevent AI systems from becoming adversarial, and that until we do it seems hard to make strong safety cases for them. But I think this inability is a skill issue, not an inherent property of the domain, and traditionally the core aim of alignment research was to gain this skill.
Plausibly we don’t have enough time to figure out how to gain as much confidence that transformative AI systems are safe as we typically have about e.g. single airplanes, but in my view that’s horrifying, and I think it’s useful to notice how different this situation is from the sort humanity is typically willing to accept.
Thanks, that's helpful context.
I also have a model of how people choose whether or not to make public statements where it’s extremely unsurprising most people would not choose to do so.
I agree it's unsurprising that few rank-and-file employees would make statements, but I am surprised by the silence from those in policy/evals roles. From my perspective, active non-disparagement obligations seem clearly disqualifying for most such roles, so I'd think they'd want to clarify.
I am quite confident the contract has been widely retracted.
Can you share your reasons for thinking this? Given that people who remain bound can’t say so, I feel hesitant to conclude that people aren’t without clear evidence.
I am unaware of any people who signed the agreement after 2019 and did not receive the email, outside cases where the nondisparagement agreement was mutual (which includes Sutskever and likely also Anthropic leadership).
Excepting Jack Clark (who works for Anthropic) and Remco Zwetsloot (who left in 2018), I would think all the po...
I have been in touch with around a half dozen former OpenAI employees who I spoke to before former employees were released and all of them later informed me they were released, and they were not in any identifiable reference class such that I’d expect OpenAI would have been able to selectively release them while not releasing most people. I have further been in touch with many other former employees since they were released who confirmed this. I have not heard from anyone who wasn’t released, and I think it is reasonably likely I would have heard from the...
Yeah, the proposal here differs from warrant canaries in that it doesn't ask people to proactively make statements ahead of time—it just relies on the ability of some people who can speak, to provide evidence that others can't. So if e.g. Bob and Joe have been released, but Alice hasn't, then Bob and Joe saying they've been released makes Alice's silence more conspicuous.
the post appears to wildly misinterpret the meaning of this term as "taking any actions which might make the company less valuable"
I'm not a lawyer, and I may be misinterpreting the non-interference provision—certainly I'm willing to update the post if so! But upon further googling, my current understanding is still that in contracts, "interference" typically means "anything that disrupts, damages or impairs business."
And the provision in the OpenAI offboarding agreement is written so broadly—"Employee agrees not to interfere with OpenAI’s relationship wit...
(This is Kelsey Piper). I am quite confident the contract has been widely retracted. The overwhelming majority of people who received an email did not make an immediate public comment. I am unaware of any people who signed the agreement after 2019 and did not receive the email, outside cases where the nondisparagement agreement was mutual (which includes Sutskever and likely also Anthropic leadership). In every case I am aware of, people who signed before 2019 did not reliably receive an email but were reliably able to get released if they emailed OpenAI HR.
If you signed such an agreement and have not been released, you can of course contact me on Signal: 303 261 2769.
See the statement from OpenAI in this article:
We're removing nondisparagement clauses from our standard departure paperwork, and we're releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual. We'll communicate this message to former employees.
They have communicated this to me and I believe I was in the same category as most former employees.
I think the main reasons so few people have mentioned this are:
I agree, but I think it still matters whether or not he's bound by the actual agreement. One might imagine that he's carefully pushing the edge of what he thinks he can get away with saying, for example, in which case he may still not be fully free to speak his mind. And since I would much prefer to live in a world where he is, I'm wary of prematurely concluding otherwise without clear evidence.
I haven't perceived the degree of focus as intense, and if I had I might be tempted to level similar criticism. But I think current people/companies do clearly matter some, so warrant some focus. For example:
- I think it's plausible that governments will be inclined to regulate AI companies more like "tech startups" than "private citizens building WMDs," the more those companies strike them as "responsible," earnestly trying their best, etc. In which case, it seems plausibly helpful to propagate information about how hard they are in fact trying, and how goo
... (read more)