Thanks for writing! I agree the factors this post describes make some types of gradient hacking extremely difficult, but I don't see how they make the following approach to gradient hacking extremely difficult.
...Suppose that an agent has some trait which gradient descent is trying to push in direction x because the x-ness of that trait contributes to the agent’s high score; and that the agent wants to use gradient hacking to prevent this. Consider three possible strategies that the agent might try to implement, upon noticing that the x-component of the tra
Thanks for posting, but I think these arguments have major oversights. This leaves me more optimistic about the extent to which people will avoid and prevent the horrible misuse you describe.
First, this post seems to overstate the extent to which people tend to value and carry out extreme torture. Maximally cruel torture fortunately seems very rare.
Thanks for writing!
I want to push back a bit on the framing used here. Instead of the framing "slowing down AI," another framing we could use is, "lay the groundwork for slowing down in the future, when extra time is most needed." I prefer this latter framing/emphasis because:
Work to spread good knowledge regarding AGI risk / doom stuff among politicians, the general public, etc. [...] Emphasizing “there is a big problem, and more safety research is desperately needed” seems good and is I think uncontroversial.
Nitpick: My impression is that at least some versions of this outreach are very controversial in the community, as suggested by e.g. the lack of mass advocacy efforts. [Edit: "lack of" was an overstatement. But these are still much smaller than they could be.]
It does, thanks! (I had interpreted the claim in the paper as comparing e.g. TPUs to CPUs, since the quote mentions CPUs as the baseline.)
Thanks! To make sure I'm following, does optimization help just by improving utilization?
Sorry, I'm a bit confused. I'm interpreting the 1st and 3rd paragraphs of your response as expressing opposite opinions about the claimed efficiency gains (uncertainty and confidence, respectively), so I think I'm probably misinterpreting part of your response?
This is helpful for something I've been working on - thanks!
I was initially confused about how these results could fit with claims from this paper on AI chips, which emphasizes the importance of factors other than transistor density for AI-specialized chips' performance. But on second thought, the claims seem compatible:
One specific concern people could have with this thoughtspace is the concern that it's hard to square with the knowledge that an AI PhD [edit: or rather, AI/ML expertise more broadly] provides. I took this point to be strongly suggested by the author's suggestions that "experts knowledgeable in the relevant subject matters that would actually lead to doom find this laughable" and that someone who spent their early years "reading/studying deep learning, systems neuroscience, etc." would not find risk arguments compelling. That's directly refuted by the surv...
experts knowledgeable in the relevant subject matters that would actually lead to doom find this laughable
This seems overstated; plenty of AI/ML experts are concerned. [1] [2] [3] [4] [5] [6] [7] [8] [9]
Quoting from [1], a survey of researchers who published at top ML conferences:
The median respondent’s probability of x-risk from humans failing to control AI was 10%
Admittedly, that's a far cry from "the light cone is about to get ripped to shreds," but it's also pretty far from finding those concerns laughable. [Edited to add: another recent survey ...
Yep! Here's a compilation.
If someone's been following along with popular LW posts on alignment and is new to governance, I'd expect them to find the "core readings" in "weeks" 4-6 most relevant.
I'm sympathetic under some interpretations of "a ton of time," but I think it's still worth people's time to spend at least ~10 hours of reading and ~10 hours of conversation getting caught up with AI governance/strategy thinking, if they want to contribute.
Arguments for this:
more researchers should backchain from “how do I make AGI timelines longer
Like you mention, "end time" seems (much) more valuable than earlier time. But the framing here, as well as the broader framing of "buying time," collapses that distinction (by just using "time" as the metric). So I'd suggest more heavily emphasizing buying end time.
One potential response is: it doesn't matter; both framings suggest the same interventions. But that seems wrong. For example, slowing down AI progress now seems like it'd mostly buy "pre-end time" (potentially by burn...
Thanks for posting this!
There's a lot here I agree with (which might not be a surprise). Since the example interventions are all/mostly technical research or outreach to technical researchers, I'd add that a bunch of more "governance-flavored" interventions would also potentially contribute.
I agree with a lot of that. Still, if
nuclear non proliferation [to the extent that it has been achieved] is probably harder than a ban on gain-of-function
that's sufficient to prove Daniel's original criticism of the OP--that governments can [probably] fail at something yet succeed at some harder thing.
(And on a tangent, I'd guess a salient warning shot--which the OP was conditioning on--would give the US + China strong incentives to discourage risky AI stuff.)
I agree it's some evidence, but that's a much weaker claim than "probably policy can't deliver the wins we need."
An earlier comment seems to make a good case that there's already more community investment in AI policy, and another earlier thread points out that the content in brackets doesn't seem to involve a good model of policy tractability.
- Perhaps the sorts of government interventions needed to make AI go well are not all that large, and not that precise.
I confess I don't really understand this view.
Specifically for the sub-claim that "literal global cooperation" is unnecessary, I think a common element of people's views is that: the semiconductor supply chain has chokepoints in a few countries, so action from just these few governments can shape what is done with AI everywhere (in a certain range of time).
I'd guess the very slow rate of nuclear proliferation has been much harder to achieve than banning gain-of-function research would be, since, in the absence of intervention, incentives to get nukes would have been much bigger than incentives to do gain-of-function research.
Also, on top of the taboo against chemical weapons, there was the verified destruction of most chemical weapons globally.
Thanks for the post - I think there are some ways heavy regulation of AI could be very counterproductive or ineffective for safety:
My problem is that most of the scenarios I see being discussed are dependent on a long chain of assumptions being true and they often seem to ignore that many things could go wrong, invalidating the full thing: you don't need to be wrong in all those steps, one of them is just enough.
This feels a bit like it might be shifting the goalposts; it seemed like your previous comment was criticizing a specific argumentative step ("reasons not to believe in doom: [...] Orthogonality of intelligence and agency"), rather than just pointing out that there were m...
- Orthogonality of intelligence and agency. I can envision a machine with high intelligence and zero agency, I haven't seen any convincing argument yet of why both things must necessarily go together
Hm, what do you make of the following argument? Even assuming (contestably) that intelligence and agency don't in principle need to go together, in practice they'll go together because there will appear to be strong economic or geopolitical incentives to build systems that are both highly intelligent and highly agentic (e.g., AI systems that can run teams). ...
So we need a way to have alignment deployed throughout the algorithmic world before anyone develops AGI. To do this, we'll start by offering alignment as a service for more limited AIs.
I'm tentatively fairly excited about some version of this, so I'll suggest some tweaks that can hopefully be helpful for your success (or for the brainstorming of anyone else who's thinking about doing something similar in the future).
We will refine and develop this deployment plan, depending on research results, commercial opportunities, feedback, and suggestions.
I s...
A more recent clarification from Paul Christiano, on how Part 1 might get locked in / how it relates to concerns about misaligned, power-seeking AI:
I also consider catastrophic versions of "you get what you measure" to be a subset/framing/whatever of "misaligned power-seeking." I think misaligned power-seeking is the main way the problem is locked in.
I'm still pretty confused by "You get what you measure" being framed as a distinct threat model from power-seeking AI (rather than as another sub-threat model). I'll try to address two defenses of that (of framing them as distinct threat models) which I interpret this post as suggesting (in the context of this earlier comment on the overview post). Broadly, I'll be arguing that: power-seeking AI is necessary for "you get what you measure" issues posing existential threats, so "you get what you measure" concerns are best thought of as a sub-threat model of ...
I'm still pretty confused by "You get what you measure" being framed as a distinct threat model from power-seeking AI (rather than as another sub-threat model)
I also consider catastrophic versions of "you get what you measure" to be a subset/framing/whatever of "misaligned power-seeking." I think misaligned power-seeking is the main way the problem is locked in.
To a lesser extent, "you get what you measure" may also be an obstacle to using AI systems to help us navigate complex challenges without quick feedback, like improving governance. But I don't think...
In my model, one should be deeply skeptical whenever the answer to ‘what would do the most good?’ is ‘get people like me more money and/or access to power.’ One should be only somewhat less skeptical when the answer is ‘make there be more people like me’ or ‘build and fund a community of people like me.’ [...] I wish I had a better way to communicate what I find so deeply wrong here
I'd be very curious to hear more fleshed-out arguments here, if you or others think of them. My best guess about what you have in mind is that it's a combination of the follo...
This is a good first attempt and it is directionally correct as to what my concerns are.
The big difference is something like your apparent instinct that these problems are practical and avoidable, limited in scope and only serious if you go 'all-in' on power or are 'doing it wrong' in some sense.
Whereas my model says that these problems are unavoidable even under the best of circumstances and at best you can mitigate them, the scope of the issue is sufficient to reverse the core values of those involved and the core values being advanced by groups in...
I agree with and appreciate the broad point. I'll pick on one detail because I think it matters.
this whole parable of the drowning child, was set to crush down the selfish part of you, to make it look like you would be invalid and shameful and harmful-to-others if the selfish part of you won [...]
It is a parable calculated to set at odds two pieces of yourself... arranging for one of them to hammer down the other in a way that would leave it feeling small and injured and unable to speak in its own defense.
This seems uncharitable? Singer's thought ex...
I agree with parts of that. I'd also add the following (or I'd be curious why they're not important effects):
More broadly though, maybe we should be using more fine-grained concepts than "shorter timelines" and "slower takeoffs":