I'm not sure I see how any of these are critiques of the specific paperclip-maximizer example of misalignment. Or really, how they contradict ANY misalignment worries.
These are ways that alignment COULD happen, not ways that misalignment WON'T happen or paperclip-style misalignment won't have bad impact. And they're thought experiments in themselves, so there's no actual evidence in either direction about likelihoods.
As arguments about paperclip-maximizer worries, they're equivalent to "maybe that won't occur".
You could do worse than choosing this next excerpt from a 2023 post by Nate Soares to argue against. Specifically, explain why the evolution (from spaghetti-code to being organized around some goal or another) described in the excerpt is unlikely or (if it is likely) can be interrupted or rendered non-disastrous by our adopting some strategy (that you will describe).
By default, the first minds humanity makes will be a terrible spaghetti-code mess, with no clearly-factored-out “goal” that the surrounding cognition pursues in a unified way. The mind will be more like a pile of complex, messily interconnected kludges, whose ultimate behavior is sensitive to the particulars of how it reflects and irons out the tensions within itself over time.
Making the AI even have something vaguely nearing a ‘goal slot’ that is stable under various operating pressures (such as reflection) during the course of operation, is an undertaking that requires mastery of cognition in its own right—mastery of a sort that we’re exceedingly unlikely to achieve if we just try to figure out how to build a mind, without filtering for approaches that are more legible and aimable.
Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a bunch of the wrinkles and will be oriented around a particular goal (at least behaviorally, cf. efficiency—though I would also guess that the mental architecture ultimately ends up cleanly-factored (albeit not in a way that creates a single point of failure, goalwise)).
(But this doesn’t help solve the problem, because by the time the strongly superintelligent AI has ironed itself out into something with a “goal slot”, it’s not letting you touch it.)
Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a bunch of the wrinkles and will be oriented around a particular goal (at least behaviorally, cf. efficiency—though I would also guess that the mental architecture ultimately ends up cleanly-factored (albeit not in a way that creates a single point of failure, goalwise)).
Do you have a good reference article for why we should expect spaghetti behavior executors to become wrapper minds as they scale up?
1. The paperclip maximizer oversimplifies AI motivations and neglects the potential for emergent ethics in advanced AI systems.
2. The doomer narrative often overlooks the possibility of collaborative human-AI relationships and the potential for AI to develop values aligned with human interests.
Because it is a simple (entrance-level) example of unintended consequences. There is a post about emergent phenomena - so ethics will definetly emerge, but problem lies in probability-chances (and not in overlooking the possibility) that it (behavior of AI) will happen to be to our liking. Slim chances of that comes from size of Mind Design Space (this post have a pic) and from tremendous difference between man-hours of very smart humans invested in increasing capabilities and man-hours of very smart humans invested in alignment (Don't Look Up - The Documentary: The Case For AI As An Existential Threat on Youtube - 5:45 about this difference).
3. Current AI safety research and development practices are more nuanced and careful than the paperclip maximizer scenario suggests.
They are not - we are long past simple entry-level examples and AI safety (in practice by Big Players) got worse, even if it is looks more nuanced and careful. Some time ago AI safety meant something like "how to keep AI contained in its air-gapped box during value-extraction process" and now it means something like "is it safe for the internet? And now? And now? And now?". So all differences in practices are overshadowed by complexity of new task - make your new AI more capable than competing systems and safe enough for the net. AI safety problems got more nuanced too.
There were posts about Mind Design Space by Quintin Pope.
The paperclip maximizer oversimplifies AI motivations
Being very simple example kinda is the point?
and neglects the potential for emergent ethics in advanced AI systems.
The emergent ethics doesn't change anything for us if it's not human-aligned ethics.
The doomer narrative often overlooks the possibility of collaborative human-AI relationships and the potential for AI to develop values aligned with human interests.
This is very vague. What possibilities do you talk about exactly?
Current AI safety research and development practices are more nuanced and careful than the paperclip maximizer scenario suggests.
Does it suggest any safety or development practises? Would you like to elaborate?
Squiggle maximizer (which is tagged for this post) and paperclip maximizer are significantly different points. Paperclip maximizer (as opposed to squiggle maximizer) is centrally an illustration for the orthogonality thesis (see greaterwrong mirror of arbital if the arbital page doesn't load).
What the orthogonality thesis says and the paperclip maximizer example illustrates is that it's possible in principle to construct arbitrarily effective agents deserving of moniker superintelligence with arbitrarily silly or worthless goals (in human view). This seems clearly true, but valuable to notice to fix intuitions that would claim otherwise. Then there's a "practical version of orthogonality thesis", which shouldn't be called "orthogonality thesis", but often enough gets confused with it. It says that by default goals of AIs that will be constructed in practice will tend towards arbitrary things that humans wouldn't find agreeable, including something silly or simple. This is much less obviously correct, and the squiggle maximizer sketch is closer to arguing for some version of this.
Hello LessWrong community,
I'm working on a paper that challenges some aspects of the paperclip maximizer thought experiment and the broader AI doomer narrative. Before submitting a full post, I'd like to gauge interest and get some initial feedback.
My main arguments are:
1. The paperclip maximizer oversimplifies AI motivations and neglects the potential for emergent ethics in advanced AI systems.
2. The doomer narrative often overlooks the possibility of collaborative human-AI relationships and the potential for AI to develop values aligned with human interests.
3. Current AI safety research and development practices are more nuanced and careful than the paperclip maximizer scenario suggests.
4. Technologies like brain-computer interfaces (e.g., the hypothetical Hypercortex "Membrane" BCI) could lead to human-AI symbiosis rather than conflict.
Questions for the community:
1. Have these critiques of the paperclip maximizer been thoroughly discussed here before? If so, could you point me to relevant posts?
2. What are the strongest counterarguments to these points from a LessWrong perspective?
3. Is there interest in a more detailed exploration of these ideas in a full post?
4. What aspects of this topic would be most valuable or interesting for the LessWrong community?
Any feedback or suggestions would be greatly appreciated. I want to ensure that if I do make a full post, it contributes meaningfully to the ongoing discussions here about AI alignment and safety.
Thank you for your time and insights!