We have to get alignment right on the first try.
If the first superhuman intelligence happens to kill us, then it won't matter that hypothetically we could have got it right on the fifth attempt.
The alignment problem is incredibly difficult.
Feel free to provide a Python code that correctly describes what humans consider "good". For any high-level concept, such as "humans are happy", propose a way how to specifically measure this "happiness" -- is it that the corners of their mouths point upwards, or that the dopamine levels in their bloodstreams are higher than some threshold, or that they self-report as happy in a questionnaire? All of the above together?
Then notice how e.g. your metric of "happiness" can get quite high without humans actually being happy. You can cut their mouths so that the corners permanently point upwards, you can keep injecting them with dopamine, you can put a gun to their heads and say "write 'I feel blissful' on this paper" or simply take the paper and fill the answers yourself (you could electrically stimulate the muscles in their hands, so that it is their hands that technically write the answers on the paper).
Why would the AI choose this fake happiness over the real one? Well, from the perspective of the program that your wrote yourself, they are both equally valid... and the latter just happens to be technically much easier to achieve, so it is likely that when the AI impartially evaluates all possible ways to happiness-as-defined, it will choose one of them.
Now if you say something like "freedom" or "free will", sure, just give me the Python code that calculates it. If you say "the AI should choose the real thing over the fake thing", sure, just give me the Python code that distinguishes between them.
The time we have left is insufficient to save the world, judging by the rate of progress in AI capabilities and lack of progress in AI alignment.
It is hard to tell exactly when we get to the superhuman AI, because if we knew exactly what it requires, we would probably already be capable of building the superhuman AI. All we can see is the ever increasing list of things that we once assumed only humans would be capable of, and now a computer can do them better than humans.
The computers can now reliably beat humans in chess; and their skills at painting and poetry are hit-and-miss, but the best results can be quite good. They sometimes succeed to give a coherent answer to a question. -- If we assume the progress doesn't stop, what is next: Coherent stories? Convincing political arguments? Scientific papers? What is the timeline: five years? ten years? twenty years?
So far the lesson is that making the computer win at chess, no matter how difficult it seemed at first, is actually much easier than writing a Python code for "happiness". If this doesn't change, at some moment we will probably get the computers with the ability to change the fate of the entire universe, and still without the possibility to tell them how to make it in a way that humans would be genuinely happy about. Then at some moment someone will press the button, and the universe will get changed in a bad way.
This feels pretty similar to this question:
Yeah, I think we share similar question, which as of yet, don’t seem to have very well-formed answers which are easily publicly available, at least.
Sort of. My impression is that there are well-formed answers that are publicly available. They just need to be distilled for a more general audience, with an emphasis on aiming lower.
Yeah, I was aware of that post before posting this question- I just posted it anyway in hopes of drawing in a different range of answers which feel more compelling to me personally.
“Define "doomed". Assuming Murphy's law, they will eventually fail. Yet some "prosaic" approaches may be on average very helpful.”
I’m defining “doomed” here as not having a chance of actually working in the real world before the world ends. So yes, that they will eventually fail, in some way or another.
“Human values aren't a static thing to be "aligned" with. They can be loved, traded with, etc.”
My understanding is that human values don’t have to be static for alignment to work. Values are constantly changing and vary across the world, but why is it so difficult to align it with some version of human values that doesn’t result in everyone dying?
To be clear, when I reference MIRI as being pessimistic, I’m mostly referring to the broad caricature of them that exists in my mind. I’m assuming that their collective arguments can be broken down into:
1.) There is no plan or workable strategy which helps us survive, and that this state of ignorance will persist until the world ends.
2.) The time we have left is insufficient to save the world, judging by the rate of progress in AI capabilities and lack of progress in AI alignment.
3.) Resources allocated to boosting AI capabilities far outweigh the resources allocated to solving alignment.
4.) The alignment problem is incredibly difficult. (Why this is true is somewhat unclear to me, and I would appreciate some help elaboration in the comments).
5.) Any prosaic alignment strategies that could possibly exist are necessarily doomed. (Why this is true is also very unclear to me, and I would greatly appreciate some elaboration in the comments).
6.) We have to get alignment right on the first try.
Put all of that together and I’ll be the first to admit that the situation doesn’t look good at all. I’d just like to know, is there some more implicit, even more damning piece of evidence which lends toward hopelessness which I’m missing here?