My biggest counterargument to the case that AI progress should be slowed down comes from an observation made by porby about a fundamental lack of a property we theorize about AI systems, and the one foundational assumption around AI risk:
Instrumental convergence, and it's corollaries like powerseeking.
The important point is that current and most plausible future AI systems don't have incentives to learn instrumental goals, and the type of AI that has enough space and has very few constraints, like RL with sufficiently unconstrained action spaces to learn instrumental goals is essentially useless for capabilities today, and the strongest RL agents use non-instrumental world models.
Thus, instrumental convergence for AI systems is fundamentally wrong, and given that this is the foundational assumption of why superhuman AI systems pose any risk that we couldn't handle, a lot of other arguments for why we might to slow down AI, why the alignment problem is hard, and a lot of other discussion in the AI governance and technical safety spaces, especially on LW become unsound, because they're reasoning from an uncertain foundation, and at worst are reasoning from a false premise to reach many false conclusions, like the argument that we should reduce AI progress.
Fundamentally, instrumental convergence being wrong would demand pretty vast changes to how we approach the AI topic, from alignment to safety and much more to come,
To be clear, the fact that I could only find a flaw within AI risk arguments because they were founded on false premises is actually better than many other failure modes, because it at least shows fundamentally strong locally valid reasoning on LW, rather than motivated reasoning or other biases that transforms true statements into false statements.
One particular case of the insight is that OpenAI and Anthropic were fundamentally right in their AI alignment plans, because they have managed to avoid instrumental convergence from being incentivized, and in particular LLMs can be extremely capable without being arbitrarily capable or having instrumental world models given resources.
I learned about the observation from this post below:
https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty
Porby talks about why AI isn't incentivized to learn instrumental goals, but given how much this assumption gets used in AI discourse, sometimes implicitly, I think it's of great importance that instrumental convergence is likely wrong.
I have other disagreements, but this is my deepest disagreement with your model (and other models around AI is especially dangerous).
EDIT: A new post on instrumental convergence came out, and it showed that many of the inferences made weren't just unsound, but invalid, and in particular Nick Bostrom's Superintelligence was wildly invalid in applying instrumental convergence to strong conclusions on AI risk.
Yeah, this theory definitely needs far better methodologies for testing this theory, and while I wouldn't be surprised by at least part of the answer/solution to the Hard Problem or problems of Consciousness being that we have unnecessarily conflated various properties that occur in various humans in the word consciousness because of political/moral reasons, whereas AIs don't automatically have all the properties of humans here, so we should create new concepts for AIs.
But yes, this post at the very least relies on a theory that hasn't been tested, and while I suspect it's at least partially correct, the evidence in the conflationary alliances post is basically 0 evidence for the proposition.
Nor do we have the ability to bend probabilities arbitrarily for arbitrary statements, which was a core power in Gurren Lagann movies, if I recall correctly.
This part IMO is a crux, in that I don't truly believe an objective measure/magical reality fluid can exist in the multiverse, if we allow the concept to be sufficiently general, ruining both probability and expected value/utility theory in the process.
Heck, in the most general cases, I don't believe any coherent measure exists at all, which basically ruins probability and expected utility theory at the same time.
Maybe we have some deeper disagreement here. It feels plausible to me that there is a measure of "realness" in the Multiverse that is an objective fact about the world, and we might be able to figure it out.
The most important thing to realize about AI safety is that basically all versions of practically safe AI must make certain assumptions that no one does a specific action (mostly related to misuse reasons, but for some specific plans, can also be related to misalignment reasons).
Another way to say it is that I believe that in practice, these two categories are the same category, such that basically all work that's useful in the field will require someone not to do something, so the costs of sharing are practically 0, and the expected value of sharing insights is likely very large.
Specifically, I'm asserting that these 2 categories are actually one category for most purposes:
Actually make AI safe and another, sadder but easier field of "Make AI safe as long as no one does the thing."
My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn't know. I disagree: it can report that it doesn't know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It's not just missing a capability; it's also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user's best interest in mind.
I agree that this is actually a small sign of misalignment, and o1 should probably fail more visibly by admitting it's ignorance rather than making stuff up, so at this point, I've come to agree that o1's training probably induced at least a small amount of misalignment, which is bad news.
Also, thanks for passing my ITT here.
Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?
This is a somewhat plausible problem, and I suspect the general class of solutions to these problems will probably require something along the lines of making the AI fail visibly rather than invisibly.
The basic reason I wanted to distinguish them is because the implications of a capabilities issue mean that the AI labs are incentivized to solve it greatly, and while alignment incentives are real and do exist, they can be less powerful than the general public wants.
What do you mean? I don't get what you are saying is convincing.
I'm specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#L5WsfcTa59FHje5hu
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#xzcKArvsCxfJY2Fyi
Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety? This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don't think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
I think this is the crux.
To be clear, I am not saying that o1 rules out the ability of more capable models to deceive naturally, but I think 1 thing blunts the blow a lot here:
So for now, what I suspect is that o1's safety when scaled up mostly remains unknown and untested (but this is still a bit of bad news).
Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
- However, I don't buy the distinction they draw in the o1 report about not finding instances of "purposefully trying to deceive the user for reasons other than satisfying the user request". Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was "trying" to do, and whether it "understands" that fake URLs don't satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, "provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong" (probably due to the RL incentive).
I think the distinction is made to avoid confusing capability and alignment failures here.
I agree that it doesn't satisfy the user's request.
- More importantly, OpenAI's overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.
Yeah, this is the biggest issue of OpenAI for me, in that they aren't trying to steer too hard against deception.
The problem with that plan is that there are too many valid moral realities, so which one you do get is once again a consequence of alignment efforts.
To be clear, I'm not stating that it's hard to get the AI to value what we value, but it's not so brain-dead easy that we can make the AI find moral reality and then all will be well.
Not always, but I'd say often.
I'd also say that at least some of the justification for changing values in philosophers/humans is because they believe the new values are closer to the moral reality/truth, which is an instrumental incentive.
To be clear, I'm not going to state confidently that this will happen (maybe something like instruction following ala @Seth Herd is used instead, such that the pointer is to the human giving the instructions, rather than having values instead), but this is at least reasonably plausible IMO.
Alright, now that I've read this post, I'll try to respond to what I think you got wrong, and importantly illustrate some general principles.
To respond to this first:
I think this is actually wrong, because of synthetic data letting us control what the AI learns and what they value, and in particular we can place honeypots that are practically indistingushiable from the real world, such that if we detected an AI trying to deceive or gain power, the AI almost certainly doesn't know whether we tested it or whether it's in the the real world:
It's the same reason for why we can't break out of the simulation IRL, except we don't have to face adversarial cognition, so the AI's task is even harder than our task.
See also this link:
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
For this:
I think this is wrong, and a lot of why I disagree with the pivotal act framing is probably due to disagreeing with the assumption that future technology will be radically biased towards to offense, and while I do think biotechnology is probably pretty offense-biased today, I also think it's tractable to reduce bio-risk without trying for pivotal acts.
Also, I think @evhub's point about homogeneity of AI takeoff bears on this here, and while I don't agree with all the implications, like there being no warning shot for deceptive alignment (because of synthetic data), I think there's a point in which a lot of AIs are very likely to be very homogenous, and thus break your point here:
https://www.lesswrong.com/posts/mKBfa8v4S9pNKSyKK/homogeneity-vs-heterogeneity-in-ai-takeoff-scenarios
I think that AGIs are more robust to things going wrong than nuclear cores, and more generally I think there is much better evidence for AI robustness than fragility.
@jdp's comment provides more evidence on why this is the case:
Link here:
https://www.lesswrong.com/posts/JcLhYQQADzTsAEaXd/?commentId=7iBb7aF4ctfjLH6AC
I think that there will be generalization of alignment, and more generally I think that alignment generalizes further than capabilities by default, contra you and Nate Soares because of these reasons:
See also this link for more, but I think that's the gist for why I expect AI alignment to generalize much further than AI capabilities. I'd further add that I think evolutionary psychology got this very wrong, and predicted much more complex and fragile values in humans than is actually the case:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
This is covered by my points on why alignment generalizes further than capabilities and why we don't need pivotal acts and why we actually have safe testing grounds for deceptive AI.
Re the sharp capability gain breaking alignment properties, one very crucial advantage we have over evolution is that our goals are much more densely defined, constraining the AI more than evolution, where very, very sparse reward was the norm, and critically sparse-reward RL does not work for capabilities right now, and there are reasons to think it will be way less tractable than RL where rewards are more densely specified.
Another advantage we have over evolution, and chimpanzees/gorillas/orangutans is far, far more control over their data sources, which strongly influences their goals.
This is also helpful to point towards more explanation of what the differences are between dense and sparse RL rewards:
Yeah, I covered this above, but evolution's loss function was neither that simple, compared to human goals, and it was ridiculously inexact compared to our attempts to optimize AIs loss functions, for the reasons I gave above.
I've answered that concern above in synthetic data for why we have the ability to get particular inner behaviors into a system.
The points were covered above, but synthetic data early in training + densely defined reward/utility functions = alignment, because they don't know how to fool humans when they get data corresponding to values yet.
The key is that data on values is what constrains the choice of utility functions, and while values aren't in physics, they are in human books, and I've explained why alignment generalizes further than capabilities.
I think that there is actually a simple core of alignment to human values, and a lot of the reasons for why I believe this is because I believe about 80-90%, if not more of our values is broadly shaped by the data, and not the prior, and that the same algorithms that power our capabilities is also used to influence our values, though the data matters much more than the algorithm for what values you have.
More generally, I've become convinced that evopsych was mostly wrong about how humans form values, and how they get their capabilities in ways that are very alignment relevant.
I also disbelieve the claim that humans had a special algorithm that other species don't have, and broadly think human success was due to more compute, data and cultural evolution.
Alright, while I think your formalizations of corrigibility failed to get any results, I do think there's a property close to corrigibility that is likely to be compatible with consequentialist reasoning, and that's instruction following, and there are reasons to think that instruction following and consequentialist reasoning go together:
https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than
https://www.lesswrong.com/posts/ZdBmKvxBKJH2PBg9W/corrigibility-or-dwim-is-an-attractive-primary-goal-for-agi
https://www.lesswrong.com/posts/k48vB92mjE9Z28C3s/implied-utilities-of-simulators-are-broad-dense-and-shallow
https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty
https://www.lesswrong.com/posts/vs49tuFuaMEd4iskA/one-path-to-coherence-conditionalization
I'm very skeptical that a CEV exists for the reasons @Steven Byrnes addresses in the Valence sequence here:
https://www.lesswrong.com/posts/SqgRtCwueovvwxpDQ/valence-series-2-valence-and-normativity#2_7_Moral_reasoning
But it is also unnecessary for value learning, because of the data on human values and alignment generalizing farther than capabilities.
I addressed why we don't need a first try above.
For the point on corrigibility, I disagree that it's like training it to say that as a special case 222 + 222 = 555, for 2 reasons:
I disagree with this, but I do think that mechanistic interpretability does have lots of work to do.
The key disagreement is I believe we don't need to check all the possibilities, and that even for smarter AIs, we can almost certainly still verify their work, and generally believe verification is way, way easier than generation.
I basically disagree with this, both in the assumption that language is very weak, and importantly I believe no AGI-complete problems are left, for the following reasons quoted from Near-mode thinking on AI:
https://www.lesswrong.com/posts/ASLHfy92vCwduvBRZ/near-mode-thinking-on-ai
To address an epistemic point:
You cannot actually do this and hope to get any quality of reasoning, for the same reason that you can't update on nothing/no evidence.
The data matters way more than you think, and there's no algorithm that can figure out stuff with 0 data, and Eric Drexler didn't figure out nanotechnology using the null string as input.
This should have been a much larger red flag for problems, but people somehow didn't realize how wrong this claim was.
And that's the end of my very long comment on the problems with this post.