Imagine aliens on a distant world. They have values very different to humans. However, they also have complicated values, and don't exactly know their own values.
Imagine these aliens are doing well at AI alignment. They are just about to boot up a friendly (to them) superintelligence.
Now imagine we get to see all their source code and research notes. How helpful would this be for humans solving alignment?
That would be very helpful; I expect we could relatively easily solve the technical problem if we could read their research notes.
As for the goal design, the intuitive way of "just hardcode your values... [in their full complexity and also determine what they 'really refer to' in the true ontology of reality (includes figuring out the true ontology of reality) in order to specify them and also make sure you really endorse this as your final choice]" is actually not doable if you're time pressed as we are; although maybe an alien civilization capable of solving alignment would not be so time pressed, and could figure that out carefully over very many years.
Known alternatives which avoid that hardness, and so are more appealing at least under time pressure, include:
Both these have the property of being copyable by us / not only working for the aliens' values.
Agreed it is natural.
To describe 'limited optimization' in my words: The teacher implements an abstract function whose optimization target is not {the outcome of a system containing a copy of this function}, but {criteria about the isolated function's own output}. The i... (read more)