Some people believe that the best approach is to define "safe alignment towards any goal" as the first step, and then insert the "goal" once this is done. I tend to think that this doesn't make sense, and we need to build the values up as we define alignment.
How would you categorize corrigibility in this framework? To me, finding ways to imbue principles of corrigibility in powerful AI systems looks like a promising "incremental" approach. The end goal of such an approach might be to use a powerful but corrigible system pivotally, but many of the properties seem valuable for safety and flourishing in other contexts.
Eliezer!corrigibility is very much not about making the corrigible system understand or maximize human values, for most values other than "not being disassembled for our atoms prematurely". But it still feels more like a positive, incremental approach to building safe systems.
Having done a lot of work on corrigibility, I believe that it can't be implemented in a value agnostic way; it needs a subset of human values to make sense. I also believe that it requires a lot of human values, which is almost equivalent to solving all of alignment; but this second belief is much less firm, and less widely shared.
I think I disagree. Based on your presentation here, I think someone following a policy inspired by this post would be more likely to cause existential catastrophe by pursuing a promising false positive that actually destroys all future value in our Hubble volume. I've argued we need to focus on minimizing false positive risk rather than optimizing for max expected value, which is what I read this as proposing we do.
I think any plan that casually includes "solve all of human morality" as a key step is not one that makes sense to even entertain. Focusing on the hypothetical positives that we would get were we to somehow pass this utterly insurmountable hurdle that all of humanity has failed at for millennia in a few years while racing with AI capabilities is roughly as good as looking forward to the Rapture. If you seek aligned AGI - which is already a very difficult hurdle - you seek it for yourself and your own, at the expense of some other group of humans whose values are at odds. If your own values are sensible enough they might still not have it too bad. If you care about humans as a whole staying the centre stage actor, then I think the only solution is to just not build AGI at all.
Well, there has never been a proof that there is a single set of human values, or that human morality is a single entity.
And value alignment isn't synonymous with safety.
I think there are reasonable shared values, but unless you keep things hopelessly vague (which I don't think you can or you get other problems) then you're sure to run into contradictions.
And yeah, some sets of values, even if seemingly benign, will have unintended dangerous consequences in the hands of a superintelligent being. Not even solving human morality (which is already essentially impossible) would suffice for sure.
tl;dr: To avoid extinction, focus more on the positive opportunities for the future.
AI is an existential risk and an extinction risk - there is a non-insignificant probability that AI/AGI[1] will become super-powered, and a non-insignificant probability that such an entity would spell doom for humanity.
Worst of all, AGI is not like other existential risks - an unaligned AGI would be an intelligent adversary, so it's not self-limiting in ways that pandemics or most other disasters are. There is no such thing as a "less vulnerable population" where dangerous AGIs are concerned - all humans are vulnerable if the AGI gets powerful enough, and it would be motivated to become powerful enough. Similarly, we can't expect human survivors to reorganise and adapt to the threat: in the most likely scenario, after an initial AGI attack, the AGI would grow stronger while humanity would grow ever more vulnerable.
To cap it all, solving superintelligent AGI alignment requires that we solve all of human morality, either directly or indirectly (or at least do that to an adequate level). If we write a goal structure that doesn't include a key part of humanity - such as, say, conscious experiences - then that part will be excised by the forward-planning AGI.
Look to the bright side
Given all that, it's natural and tempting to focus on the existential risk: to spend most time looking at doom and how to prevent it. It's a very comfortable place to be: (almost) everyone agrees that doom is bad, so everyone is in agreement. It allows us to avoid uncomfortable issues around power distribution and politics: we're not caring about which human monkey wields the big stick of AGI, we just want to avoid extinction for everyone. It makes AGI alignment into a technical issue, that we can solve as dispassionate technicians.
But things are not so comfortable. Remember that AGI alignment is not like other existential risks. We cannot construct a super-powered AGI that is simply "not an existential risk". We have to do more; remember the line, above, about solving all of human morality. We have to define human flourishing if we want humanity to flourish. We have to draw a circle around what counts as human or sentient, if we want anything human or sentient to continue to exist. What would the post-AGI political, legal, and international systems look like? Well, we don't know, but the pre-AGI choices we make will determine what it will be.
A popular idea is to delegate this issue to AGI in some way; see coherent extrapolated volition, which is an underdefined-but-wonderful-sounding approach that doesn't require spelling out exactly what the future would be like. Other people have suggested leaving moral philosophy to the AGI, so it would figure out the ideal outcome. But moral philosophy has never been a simple extension of basic principles; it's an interplay between principles and their consequences in the world, with the philosopher often doing violence to the principles to make them fit with their preferred outcome. For this to work, we need determine what the AGI should do when it encounters a new moral dilemma. And the way it resolves this dilemma determines the future of the world - we have to make important choices in this area. And, in general, it is better for us to know the consequences of choices before making them. So we have to have opinions and ideas about the post-AGI world.
Of course, this doesn't mean determining the future of humanity and sentience in exhaustive details. That would be counter-productive and also crippling for future selves. But it does mean establishing something of the basic framework in which future entities will operate (there is some similarity in with designing a constitution for a state). It's not a comfortable place for nerds like me to be in: it involves making choices on issues of equity, freedom, democracy, cooperation, and governance, choices that expose us to objections from others and drag us into the mind-killing mess of politics. But we can't put that off forever. Choosing not to engage in these issues is a choice, which is to abdicate decision making to others[2] (a good example of this kind of planning can be found in the Fun Sequences).
In compensation, if we do engage with positively designing the future, then a whole world of new techniques open up to us. It may be possible to build alignment incrementally - to make systems that are more and more aligned (or, as I find more likely, that are aligned in more and more situations). This kind of approach requires a target to aim at: an idea of what positive outcomes we want. It also allows us to engage with researchers with short-term safety objectives, and find ways to extend these objectives, and to take current or future designs for AI cooperation, and build upon these.
Note that there is a philosophical approach difference here. Some people believe that the best approach is to define "safe alignment towards any goal" as the first step, and then insert the "goal" once this is done. I tend to think that this doesn't make sense, and we need to build the values up as we define alignment. I might be wrong; still, if you don't plan more for the ultimate outcome, then you're restricted to only exploring the first approach, and you have to trust that the "goal" will work once it's inserted. Similarly, incremental progress towards AGI cooperation may be impossible or possible. If it's possible, we can nudge organisations to be more cooperative or more aligned, though that needs some theory of cooperation to aim for. If it's impossible, then we can only survive via a dramatic pivotal act. Before leaping to that conclusion, let's explore if the targeted incremental approach can work.
We steer towards what we look at. And we've been spending too much time looking at the disasters. To survive AGI, we need to flourish with AGI. And so we should focus on the flourishing, aim consciously towards that target.
The term AI has been so overused by marketers and journalists today, that I'll use AGI instead.
Of course, we shouldn't engage in political fights more than we strictly have to - politics remains pretty mind-killing in most ways.