If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.
In The genie knows, but it doesn't care, RobbBB argues that even if an AI is intelligent enough to understand its creator's wishes in perfect detail, that doesn't mean that its creator's wishes are the same as its own values. By analogy, even though humans were optimized by evolution to have as many descendants as possible, we can understand this without caring about it. Very smart humans may have lots of detailed knowledge of evolution & what it means to have many descendants, but then turn around and use condoms & birth control in order to stymie evolution's "wishes".
I thought of a potential way to get around this issue:
Create a tool AI.
Use the tool AI as a tool to improve itself, similar to the way I might use my new text editor to edit my new text editor's code.
Use the tool AI to build an incredibly rich world-model, which includes, among other things, an incredibly rich model of what it means to be Friendly.
Use the tool AI to build tools for browsing this incredibly rich world-model and getting explanations about what various items in the ontology correspond to.
Browse this incredibly rich world-model. Find the item in the ontology that corresponds to universal flourishing and tell the tool AI "convert yourself in to an agent and work on this".
There's a lot hanging on the "tool AI/agent AI" distinction in this narrative. So before actually working on this plan, one would want to think hard about the meaning of this distinction. What if the tool AI inadvertently self-modifies & becomes "enough of an agent" to deceive its operator?
The tool vs agent distinction probably has something to do with (a) the degree to which the thing acts autonomously and (b) the degree to which its human operator stays in the loop. A vacuum is a tool: I'm not going to vacuum over my prized rug and rip it up. A Roomba is more of an agent: if I let it run while I am out of the house, it's possible that it will rip up my prized rug as it autonomously moves about the house. But if I stay home and glance over at my Roomba every so often, it's possible that I'll notice that my rug is about to get shredded and turn off my Roomba first. I could also be kept in the loop if the thing gives me warnings about undesirable outcomes I might not want: for example, my Roomba could scan the house before it ran, giving me an inventory of all the items it might come in contact with.
An interesting proposition I'm tempted to argue for is the "autonomy orthogonality thesis". The original "orthogonality thesis" says that how intelligent an agent is and what values it has are, in principle, orthogonal. The autonomy orthogonality thesis says that how intelligent an agent is and the degree to which it has autonomy and can be described as an "agent" are also, in principle, orthogonal. My pocket calculator is vastly more intelligent than I am at doing arithmetic, but it's still vastly less autonomous than me. Google Search can instantly answer questions it would take me a lifetime to answer working independently, but Google Search is in no danger of "waking up" and displaying autonomy. So the question here is whether you could create something like Google Search that has the capacity for general intelligence while lacking autonomy.
I feel like the "autonomy orthogonality thesis" might be a good steelman of a lot of mainstream AI researchers who blow raspberries in the general direction of people concerned with AI safety. The thought is that if AI researchers have programmed something in detail to do one particular thing, it's not about to "wake up" and start acting autonomous.
Another thought: One might argue that if a Tool AI starts modifying itself in to a superintelligence, the result will be too complicated for humans to ever verify. But there's an interesting contradiction here. A key disagreement in the Hanson/Yudkowsky AI-foom debate was the existence of important, undiscovered chunky insights about intelligence. Either these insights exist or they don't. If they do, then the amount of code one needs to write in order to create a superintelligence is relatively little, and it should be possible for humans to independently verify the superintelligence's code. If they don't, then we are more likely going to have a soft takeoff anyway because intelligence is about building lots of heterogenous structures and getting lots of little things right, and that takes time.
Another thought: maybe it's valuable to try to advance natural language processing, differentially speaking, so AIs can better understand human concepts by reading about them?
Evolution doesn't have "wishes". It's not a teleological entity.