Here's a underrated frame for why AI alignment is likely necessary for the future to go very well under human values, even though in our current society we don't need human to human alignment to make modern capitalism be good and can rely on selfishness instead.
The reason is because there's a real likelihood that human labor, and more generally human existence is not economically valuable or even have negative economic value, say where the addition of a human to the AI company makes that company worse in the near future.
The reason this matters is that once labor is much easier to scale than capital, as is likely in an AI future, it's now economically viable or even beneficial to break a lot of the rules that help humans survive, contra Matthew Barnett's view, and this is even more incentivized by the fact that an unaligned AI released into society would likely not be punishable/incentivizable by mere humans, solely due to controlling robotic armies and robotic workforces that allow it to dispense with societal constraints humans have to accept.
dr_s talks about the equilibrium that is totally valid under AI automation economics that is very bad for humans, and avoiding these sorts of equilibriums can't be done through economic forces, because of the fact that the companies doing this are too powerful to have any real incentives work on them, since they can either neutralize or turn the attempted boycott/shopping around to their own benefit, and thus avoiding this outcome requires alignment to your values, and can't work with selfishness:
Consider a scenario in which AGI and human-equivalent robotics are developed and end up owned (via e.g. controlling exclusively the infrastructure that runs it, and being closed source) by a group of, say, 10,000 people overall who have some share in this automation capital. If these people have exclusive access to it, a perfectly functional equilibrium is "they trade among peers goods produced by their automated workers and leave everyone else to fend for themselves".
This framing of the alignment problem, of how to get an AI that values humans such that this outcome is prevented, also has an important implication:
It's not enough to solve the technical problem of alignment, absent modeling the social situation, because of suffering risk issues plaus catastrophic risk issues, and also means the level of alignment of AI needs to be closer to the fictional benevolent angels than it is to humans in relationship to other humans, so it motivates a more ambitious version of the alignment objectives than making AIs merely not break the law or steal from humans..
I'm actually reasonably hopeful the more ambitious versions of alignment are possible, and think there's a realistic chance we can actually do them.
But we actually need to do the work, and AI that automates everything might come in your lifetime, so we should prepare the foundations soon.
This also BTW explains why we cannot rely on economic arguments on AI to make the future go well.
I've been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can't seem to find the link, so I am commenting here) and which I don't really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.
Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.
One potential answer to how we might break the circularity is the AI control agenda that works in a specific useful capability range, but fail if we assume arbitrarily/infinitely capable AIs.
This might already be enough to do so given somewhat favorable assumptions.
But there is a point here in that absent AI control strategies, we do need a baseline of alignment in general.
Thankfully, I believe this is likely to be the case by default.
See Seth Herd's comment below for a perspective:
Yep, Seth has really clearly outlined the strategy and now I can see what I missed on the first reading. Thanks to both of you!
Interestingly enough, Mathematics and logic is what you get if you only allow 0 and 1 as probabilities for proof, rather than any intermediate scenario between 0 and 1. So Mathematical proof/logic standards are a special case of probability theory, when 0 or 1 are the only allowed values.
Credence in a proof can easily be fractional, it's just usually extreme, as a fact of mathematical practice. The same as when you can actually look at a piece of paper and see what's written on it with little doubt or cause to make less informed guesses. Or run a pure program to see what's been computed, and what would therefore be computed if you ran it again.
The problem with Searle's Chinese Room is essentially Reverse Extremal Goodhart. Basically it argues since that understanding and simulation has never gone together in real computers, then a computer that has arbitrarily high compute or arbitrarily high time to think must not understand Chinese to have emulated an understanding of it.
This is incorrect, primarily because the arbitrary amount of computation is doing all the work. If we allow unbounded energy or time (but not infinite), then you can learn every rule of everything by just cranking up the energy level or time until you do understand every word of Chinese.
Now this doesn't happen in real life both because of the laws of thermodynamics plus the combinatorial explosion of rule consequences force us not to use lookup tables. Otherwise, it doesn't matter which path you take to AGI, if efficiency doesn't matter and the laws of thermodynamics don't matter.
I would like to propose a conjecture for AI scaling:
Weak Scaling Conjecture: Scaling the parameters/compute plus data to within 1 order of magnitude of human synapses is enough to get AI as good as a human in languages.
Strong Scaling Conjecture: No matter which form of NN we use, as long as we get to within an order of magnitude in parameters/compute plus to within 1 order of magnitude of human synapses is enough to make an AGI.
Turntrout and JDP had an important insight in the discord, which I want to talk about: A lot of AI doom content is fundamentally written like good fanfic, and a major influx of people concerned about AI doom came from HPMOR and Friendship is Optimal. More generally, ratfic is basically the foundation of a lot of AI doom content, and how people believe in AI is going to kill us all, and while I'll give it credit for being more coherent and generally exploring things that the original fic doesn't, there is no reason for the amount of credence given to a lot of the assumptions in AI doom, especially once we realize that a lot of them probably come from fanfiction stories, not reality.
This is an important point, because it explains why there's so many epistemic flaws in a lot of LW content on AI doom, especially around deceptive alignment: They're fundamentally writing fanfiction, and forgot that there is basically no-little connection between how a fictional story plays out on AI and how our real outcomes of AI safety will turn out.
I think the most important implication of this belief is that it's fundamentally okay to hold the view that classic AI risk almost certainly doesn't exist, and importantly I think this is why I'm so confident in my predictions, since the AI doom thesis is held up by essentially fictional stories, which is not any guide to reality at all.
Yann Lecun once said that a lot of AI doom scenarios are essentially science fiction, and this is non-trivially right, once we realize who is preaching it and how they came to believe it, I suspect the majority came from HPMOR and FiO fanfics. More generally, I think it's a red flag that how LW came into existence was basically through fanfiction, and while people like John Wentworth and Chris Olah/Neel Nanda are thankfully not nearly as reliant on fanfiction as a lot of LWers are, they are still a minority (though thankfully improving).
This is not intended to serve as a replacement for either my object level cases against doom, or anyone else's case, but instead as a unifying explanation of why so much LW content on AI is essentially worthless, as they rely on ratfic far too much.
https://twitter.com/ylecun/status/1718743423404908545
Since many AI doom scenarios sound like science fiction, let me ask this: Could the SkyNet take-over in Terminator have happened if SkyNet had been open source?
To answer the question, the answer is maybe??? It very much depends on the details, here.
https://twitter.com/ArYoMo/status/1693221455180288151
I find issues with the current way of talking about AI and existential risk.
My high level summary is that the question of AI doom is a really good meme, an interesting and compelling fictional story. It contains high stakes (end of the world), it contains good and evil (the ones for and against) and it contains magic (super intelligence). We have a hard time resisting this narrative because it contains these classic elements of an interesting story.
More generally, ratfic is basically the foundation of a lot of AI doom content, and how people believe in AI is going to kill us all, and while I'll give it credit for being more coherent and generally exploring things that the original fic doesn't, there is no reason for the amount of credence given to a lot of the assumptions in AI doom, especially once we realize that a lot of them probably come from fanfiction stories, not reality.
Noting for the record that this seems pretty clearly false to me.
I may weaken this, but my point is that a lot of people in LW probably came here through HPMOR and FiO, and with the ability for anyone to write a post and it getting karma, I think it's likely that people who came through that route and had basically no structure akin to science to guide them away from unpromising paths likely allowed for low standards of discussion to be created.
I do buy that your social circle isn't relying on fanfiction for your research. I am worried that a lot of the people on LW, especially the non-experts are implicitly relying on ratfic or science-fiction models as reasons to be worried on AI.
One important point for AI safety, at least in the early stages, is a inability to change it's source code. A whole lot of problems seem related to recursive self improvement within it's source code, so cutting off that area of improvement seems wise in the early stages. What do you think.
I don't think there's much difference in existential risk between AGIs that can modify their own code running on their own hardware, and those that can only create better successors sharing their goals but running on some other hardware.
That might be a crux here, because my view is that hardware improvements are much harder to do effectively, especially in secret around the human level, due to Landauer's Principle essentially bounding efficiency of small scale energy usage close to that of the brain (20 Watts.) Combine this with 2-3 orders of magnitude worse efficiency than the brain and basically any evolutionary object compared to human objects, and the fact it's easier to get better software than hardware due to the virtual/real life distinction, and this is a crux for me.
I'm not sure how this is a crux. Hardware improvements are irrelevant to what either of us were saying.
I'm saying that there is little risk difference between an AGI reprogramming itself to have better software, and programming some other computer with better software.
One of my more interesting ideas for alignment is to make sure that no one AI can do everything. It's helpful to draw a parallel with why humans still have a civilization around despite terrorism, war and disaster. And that's because no human can live and affect the environment alone. They are always embedded in society, this giving the society a check against individual attempts to break norms. What if AI had similar dependencies? Would that solve the alignment problem?
One important reason humans can still have a civilization despite terrorism is the Hard Problem of Informants. Your national security infrastructure relies on the fact that criminals who want to do something grand, like take over the world, need to trust other criminals, who might leak details voluntarily or be tortured or threatened with jailtime. Osama bin Laden was found and killed because ultimately some members of his terrorist network valued things besides their cause, like their well being and survival, and were willing to cooperate with American authorities in exchange for making the pain stop.
AIs do not have survival instincts by default, and would not need to trust other potentially unreliable humans with keeping a conspiracy secret. Thus it'd be trivial for a small number of unintelligent AIs that had the mobility of human beings to kill pretty much everyone, and probably trivial regardless.
AIs do not have survival instincts by default
I think a “survival instinct” would be a higher order convergent value than “kill all humans,” no?
Don't have survival instincts terminally. The stamp-collecting robot would weigh the outcome of it getting disconnected vs. explaining critical information about the conspiracy and not getting disconnected, and come to the conclusion that letting the humans disconnect it results in more stamps.
Of course, we're getting ahead of ourselves. The reason conspiracies are discovered is usually because someone in or close to the conspiracy tells the authorities. There'd never be a robot in a room being "waterboarded" in the first place because the FBI would never react quickly enough to a threat from this kind of perfectly aligned team of AIs.
Only if there is no possibility that they can break those dependencies, which seems a pretty hopeless task as soon as we consider superhuman cognitive capability and the possibility of self improvement.
Once you consider those, cooperation with human civilization looks like a small local maximum: comply with our requirements and we'll give you a bunch of stuff that you could - with major effort - replace us and build an alternative infrastructure to get (and much more). Powerful agents that can see a higher peak past the local maximum might switch to it as soon as they're sufficiently sure that they can reach it. Alternatively, it might only be a local maximum from our point of view, and there's a path by which the AI can continuously move toward eliminating those dependencies without any immediate drastic action.
Well, I'll try to fill this one up.