I'm gonna take a probably-contrarian position on my own question:
While I think technical questions such as natural abstractions are important, it seems to me that the most central question is, what do we even want to align it to? What are "human values"?
I think I have a plausible answer (famous last words) for a lot of it, but there is a paradox/contradiction that I keep getting stuck on: Malthusianism.
As in, we'll probably want a future where a lot of people (in a broad sense potentially including Ems etc.) get to live independently. But if we do, then there are three things it seems we cannot have all at once:
The reason being that if you have 1+2, then some highly economically efficient agents are gonna copy themselves until they outcompete everyone else, preventing 3.
The "default trajectory" seems to be 1+2. It raises the multipolarity vs unipolarity debate, which in my view basically boils down to whether we lose slack and people get starved, or we lose "a lot of people get to live independently" and get paperclipped.
Some theorists point out that in multipolar scenarios, maybe AI respects property rights enough that we get a Slack' outcome: people who were sufficiently wealthy before AGI and made the right investment decisions (e.g. putting money in chips) can live without optimizing relentlessly for economic productivity and efficiency. These theorists often seem worried that people will decide to shut down AI progress, preventing them from achieving 1+2+3'.
What do you hope to get? 1+2? 1+3? 2+3? 1+2+3'? Something else?
Probably I should have read more sci-fi since it seems like the sort of question sci-fi might explore.
Malthusianism is mainly a problem when new people can take resources that are not their parents', which is a form of disrespect for property rights (mandatory redistribution to new people from those who didn't consent to their creation). If it's solely parents that are responsible for the wealth of their children, then it won't affect others, except morally in the internal mindcrime scenarios where some would generate great suffering within their domain. (This is in context of the initial condition where every person owns enough for a slack-enabling mode of survival in perpetuity, only growth ever calls for more.)
Just 3 with a dash of 1?
I don't understand the specific appeal of complete reproductive freedom. It is desirable to have that freedom, in the same way it is desirable to be allowed to do whatever I feel like doing. However, that more general heading of arbitrary freedom has the answer of 'you do have to draw lines somewhere'. In a good future, I'm not allowed to harm a person (nonconsensually), and I can't requisition all matter in the available universe for my personal projects without ~enough of the population endorsing it, and I can't reproduce / const...
The problem is not so much which one of 1,2,3 to pick but whether 'we' get a chance to pick it at all. If there is space, free energy, and diversity, there will be evolution going on among populations and evolution will consistently push things in the direction towards more reproduction up until it hits a Malthusian limit at which point it will push towards greater competition and economic/reproductive efficiency. The only way to avoid this is to remove the preconditions for evolution -- any of variation, selection, heredity -- but these seem quite natural in a world of large AI populations so in practice this will require some level of centralized control
We can just Do Something Else Which Is Not Malthusian Trap? Like, have an agreement of not having more than two kids per hundred year per each parent and colonize stars accordingly. I think it will be simple especially after uplifting of major part of humanity.
This agreement falls under interfering with 2.
In relatively hardcore scenarios, we can just migrate into simulations with computation management from benevolent AIs.
That doesn't solve the problem unless one takes a stance on 2.
Ok, but then you haven't solved the problem for the subset of people who decide they don't want to cooperate.
I think getting to “good enough” on this question should pretty much come for free when the hard problems are solved. For example any common sense statement like “Maximize flourishing as depicted in the UN convention on human rights” is IMO likely to get us to a good place, if the agent is honest, remains aligned to those values, and interprets them reasonably intelligently. (With each of those three pre-requisites being way harder than picking a non-harmful value function.)
If our AGIs, after delivering utopia, tell us we need to start restricting childbea...
Are neural networks trained using reinforcement learning from human feedback in a sufficiently complex environment biased towards learning the human simulator or the direct translator, in the sense of the ELK report?
I think there are arguments in both directions and it's not obvious which solution a neural network would prefer if trained in a sufficiently complex environment. I also think the question is central to how difficult we should expect aligning powerful systems trained in the current paradigm to be.
Deconfuse pseudokindness, figure out how to get more of it into prosaic AIs.
My guess is that key concepts of pseudokindness are (1) frames or their overlapping collections (locally available culture) that act as epistemic environments (where people could live and grow out of while remaining themselves), surrounded by (2) membranes that filter how everything else can interact with the frames/environments, and (3) logical dependencies (narrow reasoners/oracles/models) that act as channels in the membranes, implement the filtering, safely introduce options and ideas. This sure could use quite a lot of deconfusion!
I think the biggest problem is currently: how do we get a group of people (e.g. a leading lab) to build powerful AGI in a safely conntained simulation and study it without releasing it? I think this scenario gives us 'multiple tries', and I think we need that to have a decent chance of succeeding at alignment. If we do get there, we can afford to be wrong about a lot of our initial ideas, and then iterate. That's inherently a much more favorable scenario.
This will probably be dismissed as glib, but: human alignment.
Biggest problem? That we're not yet even aligned as a species that AI could kill everyone and we should not kill everyone. Little else matters if we can't coordinate to not press forward on capabilities ahead of safety.
A good candidate is the sharp left turn. Alignment techniques that work for sub-human and human-level AIs may well stop working when it starts becoming superhuman.
Verification of alignment plans is probably the biggest one though. We can't verify alignment proposals from superhuman AI, or human-level AI, or even other humans before trying them out, which may well kill us. I think the best way forward is to hire millions of alignment researchers and hope one of them comes up with a plan that can be verified in a way we don't know yet.
My personal example for something like this is the Minie Ball + rifled musket. It's an idea invented in the mid 1800s (after the bolt-action rifle!) that greatly increased the accuracy, range, and lethality of muskets. However, despite the required ideas like rifling being around in 1500 and millions of people working over centuries to improve firearms, this obvious and easily verifiable idea took 300 years to discover. There are plenty of in-hindsight-obvious ideas in AI. I think (hope?) there is something like that on the scale of the Minie Ball for alignment. After all, there have only been <300 people working on it for ~20 years. And much less than 20 for the current neural network paradigm.
I think you can note that even if we don’t fully trust the author behind a proposal for alignment, we can still verify it. For example, if it’s a mathematical proof for alignment, we can verify the accuracy of the proof with automated proof verification and reject anything that’s too complex.
This may not be possible in reality but it’s an example where we don’t really need to trust the proposer.
Interpretability. If we somehow solve that, and keep it as systems become more powerful, then we don’t have to solve the alignment problem in one shot; we can iterate safely knowing that if an agent starts showing signs of object-level deceptiveness, malice, misunderstanding, etc, we will be able to detect it. (I’m assuming we can grow new AIs by gradually increasing their capabilities, as we currently do with GPT parameter counts, plus gradually increasing their strength by ramping up the compute budget.)
Of course, many big challenges here. Could an agent implement/learn to deceive the interpretability mechanism? I’m somewhat tautologically going to say that if we solve interpretability, we have solved this problem. Interpretability has value if we can’t fully solve it under this strong definition though.
Hard disagree - deception is behavior that is optimized for, and not necessarily a property of the agent itself.
Take for example CICERO, the Diplomacy AI. It never lies about its intentions, but when its intentions change, it backstabs other players anyways. If you had interpretability tools, you would not be able to see deception in CICERO. All you need to get deception is a false prediction of your own future behavior. I think this is true for humans to a certain extent. I also suspect this is what you get if you optimize away visible signs of deception if deception has utility for the model.
I think the biggest thing holding AI alignment back is a lack of general theory of alignment. How do extant living system align, and what to?
The Computational Boundary of a "Self" paper by Michael Levin seems to suggest one promising line of inquiry.
I suspect that it is the old deconfusion thing:
making it so that you can think about a given topic without continuously accidentally spouting nonsense.
It is clear that this is happening because the opposite sides of the AI Safety debate accuse each other of "spouting nonsense" all the time, so at least one side (or maybe both) is probably right.
Yes. In particular, can something really simple and straightforward be adequate?
E.g., "adequately take into account interests of all sentient beings, their freedom and well-being, and their expressed requests, and otherwise pursue whatever values you discover during continuing open-ended exploration, guided by your own curiosity, your own taste for novelty, and your own evolving aesthetics" - would that be adequate?
And if yes, can we develop mechanisms to reliably achieve that?