I think the intersection of ethics / moral philosophy / normative descriptions of the current ethics of specific subcultures / etc is a particularly important topic. I'm excited also that it seems to involve a very different skill set than other important tasks in AI alignment like interpretability work. I've read some great work that's already been done on ethics alignment, and I'm sure there's more I don't know of since this isn't my specific area of focus. I think it'd be a valuable contribution to the alignment community for someone to do a thorough literature analysis of ethics alignment.
Example paper: Aligning AI With Shared Human Values
Thanks for the pointer to the paper, saved for later! I think this task of crafting machine-readable representations of human values is a thorny step in any CEV-like/value-loading proposal which doesn't involve the AI inferring them itself IRL-style.
I was considering sifting through literature to form a model of ways people tried to do this in an abstract sense. Like, some approaches aim at a fixed normative framework. Others involve an uncertain seed which is collapsed to a likely framework. Others involve extrapolating from fixed to an uncertain distribution of possible places an initial framework drifted towards. Does this happen to ring a bell about any other references?
This post has been written for the second Refine blog post day, at the end of the first week of iterating on ideas and concretely aiming at the alignment problem.
Thanks to Adam Shimi, Dan Clothiaux, Linda Linsefors, Nora Ammann, TJ, Ze Shen for discussions which inspired this post.
I currently believe prosaic risk scenarios are the most plausible depictions of us messing up (e.g. Ajeya Cotra's HFDT takeover scenario). That's why I find it exciting to focus on making progress on those ones in particular, rather than scenarios involving ideal agents in esoteric circumstances, exotic brain-like architectures, etc. That said, my risk model is very much in its infancy, largely based on bitter lessons from cognitive modeling, expert systems, and other non-ML approaches to learning and reasoning I briefly explored in undergrad.
In the context of this post, making progress on prosaic risk scenarios means coming up with proposals which decrease the estimated probability of extinction compared to baselines of minimal safety efforts. As a person with background in ML, I find it quite natural to operationalize this goal through benchmarks which you can then test proposals against. With this approach in mind, I'm then excited about pushing the state-of-the-art on prosaic risk scenarios.
However, one might argue that overfitting solutions to perform well on specific scenarios might lead to brittleness when encountering novel failure modes which we never thought of. The proposal wouldn't be robust, it would generalize poorly, something something getting closer to a moon landing by climbing a tree. One way of tackling this issue would be to compile a comprehensive evaluation harness which contains scenarios which individually test for broad classes of potential issues (e.g. deception in general as opposed to a specific deceptive behavior). Another way of approaching this would be to frame scenarios as hits on the attack surface exposed by a proposal. As scenarios move from concrete details to general failure modes, the associated hit moves from resembling a point to being area-of-effect style. Modeling the attack surface effectively would then help gauge the performance of the proposal.
Another concern of using risk scenarios as benchmarks for proposals would be the lack of objective metrics. You don't get a clean accuracy with high reproducibility, you get a messy estimate based on the rater's risk model and imagination of training stories. I think that's okay, as benchmarks are primarily useful as heuristics to surface promising proposals (e.g. think of CNNs at ImageNet ~2013). This means that testing a batch of proposals against different scenarios to see which ones make a dent might still yield a useful (albeit noisy) prioritization signal, especially in the infamously pre-paradigmatic and confused state of alignment.
The rest of this post consists in testing two proposals against a previous scenario which is briefly summarized below for convenience. All content is quoted from the tree-shaped website I'm using as a scratchpad throughout Refine.
All in all, the scenarios-as-benchmarks approach offered a relatively concrete structure to navigate the potential and drawbacks of specific proposals. The approach forced me to think through how exactly a proposal would be concretely applied, which I also find quite valuable. On the other hand, benchmarks are by their very nature imperfect, especially given that you can't test the actual thing here. Still, looking forward to trying out this and other such approaches throughout Refine!