The top-rated comment on "AGI Ruin: A List of Lethalities" claims that many other people could've written a list like that.
"Why didn't you challenge anybody else to write up a list like that, if you wanted to make a point of nobody else being able to write it?" I was asked.
Because I don't actually think it does any good, or persuades anyone of anything, people don't like tests like that, and I don't really believe in them myself either. I couldn't pass a test somebody else invented around something they found easy to do, for many such possible tests.
But people asked, so, fine, let's actually try it this time. Maybe I'm wrong about how bad things are, and will be pleasantly surprised. If I'm never pleasantly surprised then I'm obviously not being pessimistic enough yet.
So: As part of my current fiction-writing project, I'm currently writing a list of some principles that dath ilan's Basement-of-the-World project has invented for describing AGI corrigibility - the sort of principles you'd build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it.
So far as I know, every principle of this kind, except for Jessica Taylor's "quantilization", and "myopia" (not sure who correctly named this as a corrigibility principle), was invented by myself; eg "low impact", "shutdownability". (Though I don't particularly think it hopeful if you claim that somebody else has publication priority on "low impact" or whatevs, in some stretched or even nonstretched way; ideas on the level of "low impact" have always seemed cheap to me to propose, harder to solve before the world ends.)
Some of the items on dath ilan's upcoming list out of my personal glowfic writing have already been written up more seriously by me. Some haven't.
I'm writing this in one afternoon as one tag in my cowritten online novel about a dath ilani who landed in a D&D country run by Hell. One and a half thousand words or so, maybe. (2169 words.)
How about you try to do better than the tag overall, before I publish it, upon the topic of corrigibility principles on the level of "myopia" for AGI? It'll get published in a day or so, possibly later, but I'm not going to be spending more than an hour or two polishing it.
The way I interpreted "Fulfilling the task is on the simplest trajectory to non-existence" sort of like "the teacher aims to make itself obsolete by preparing the student to one day become the teacher." A good AGI would, in a sense, have a terminal goal for making itself obsolete. That is not to say that it would shut itself off immediately. But it would aim for a future where humanity could "by itself" (I'm gonna leave the meaning of that fuzzy for a moment) accomplish everything that humanity previously depended on the AGI for.
Likewise, we would rate human teachers in high school very poorly if either:
1. They immediately killed themselves because they wanted to avoid at all costs doing any harm to their own students.
2. We could tell that most of the teacher's behavior was directed at forever retaining absolute dictatorial power in the classroom and making sure that their own students would never get smart enough to usurp the teacher's place at the head of the class.
We don't want an AGI to immediately shut itself off (or shut itself off before humanity is ready to "fly on its own," but we also don't want an AGI that has unbounded goals that require it to forever guard its survivial.
We have an intuitive notion that a "good" human teacher "should" intrinsically rejoice to see that they have made themselves obsolete. We intuitively applaud when we imagine a scene in a movie, whether it is a martial arts training montage or something like "The Matrix," where the wise mentor character gets to say, "The student has become the teacher."
In our current economic arrangement, this is likely to be more of an ideal than a reality because we don't currently offer big cash prizes (on the order of an entire career's salary) to teachers for accomplishing this, and any teacher that actually had a superhuman ability at making their own students smarter than themselves and thus making themselves obsolete would quickly flood their own job market with even-better replacements. In other words, there are strong incentives against this sort of behavior at the limit.
I have applied this same sort of principle when talking to some of my friends who are communists. I have told them that, as a necessary but not sufficient condition for "avoiding Stalin 2.0," for any future communist government, "the masses" must make sure that there incentives already in place, before that communist government comes to power, for that communist government to want to work towards making itself obsolete. That is to say, there must be incentives in place such that, obviously, the communist party doesn't commit mass suicide right out of the gate, but nor does it want to try to keep itself indispensable to the running of communism once communism has been achieved. If the "state" is going to "wither away" as Marx envisioned, there need to be incentives in place, or a design of the communist party in place, for that path to be likely since, as we know now, that is OBVIOUSLY not the default path for a communist party.
I feel like, if we could figure out an incentive structure or party structure that guaranteed that a communist government would actually "wither away" after accomplishing its tasks, we would be a small step towards the larger problem of guaranteeing that an AGI that is immensely smarter than a communist party would also "wither away" after attaining its goals, rather than try to hold onto power at all costs.