I sometimes aim for aligning AI with an abstract notion of human values, defined in a way that coincides with my own preferred understanding of the term.
Or sometimes, one meta-level farther up: an abstract notion of human values according to humans' own aggregated understanding, bootstrapped in a dynamic whose starting point was close to my own best understanding of the terms.
Aligned values that should be used for optimization, even if they can be defined by something like extrapolated volition, are not tractable (ready for immediate use), so an AGI can't be aligned in this sense directly. Optimization to any other values, or to an even slightly imperfect approximation of these values, is the essence of AI risk, so an aligned AGI needs to avoid hard optimization altogether until such time that aligned values become tractable (which might never actually happen).
Instead, AGI must be aligned in the sense of aspiring to attain aligned values (or at least to cooperate with their instillment) and in the sense of not causing problems and hopefully being beneficial in the meantime. AGI alignment is not about alignment of values in the present, it's about creating conditions for eventual alignment of values in the distant future.
AGI alignment is not about alignment of values in the present, it's about creating conditions for eventual alignment of values in the distant future.
What should these values in the distant future be? That's my question here.
The most naive possible answer is that by law any future AI should be designed to be part of human society.
You are completely correct. That is indeed the most naive possible answer. And also the most X, for various values of X, none of them being good things for an answer to be.
Irrespective of technical questions, what values would you align an AGI with?
Or can you point me to works of alignment researchers that to some extent formulate these values?