Stephen Casper, scasper@mit.edu. Thanks to Alex Lintz and Daniel Dewey for feedback.
This is a reply but not an objection to a recent post from Paul Christiano titled AI alignment is distinct from its near term applications. The post is fairly brief and the key point is decently summed up by this excerpt.
I worry that companies using alignment to help train extremely conservative and inoffensive systems could lead to backlash against the idea of AI alignment itself. If such systems are held up as key successes of alignment, then people who are frustrated with them may end up associating the whole problem of alignment with “making AI systems inoffensive.”
I have no disagreements with this claim. But I would push back against the general notion that AI [existential] safety work is disjoint from near term applications. Paul seems to agree with this.
We can develop and apply alignment techniques to these existing systems. This can help motivate and ground empirical research on alignment, which may end up helping avoid higher-stakes failures like an AI takeover.
This post argues for strongly emphasizing this point.
What do I mean by near-term applications? Any challenging problem involving consequential AI and society. Examples include:
- Self-driving cars
- Recommender systems
- Search engines
- AI weapons
- Cybersecurity
- Unemployment
- Bias/fairness/justice involving systems that work with humans or human data
- Misuses of media generators including text and image generators
I argue that working on these problems probably matters a lot for three reasons. The second and third of which are potential matters of existential safety.
Non X-risks from AI are still intrinsically important AI safety issues.
There are many important non X-risks in this world, and any altruistic-minded person should care about them. For the same reason we should care about health, wealth, development, and animal welfare, we should also care about making important near-term applications of narrow AI go well for people.
There are valuable lessons to learn from near-term applications.
Imagine that we figured out ways to make near-term applications of AI go very well. I find it incredibly hard to imagine a world in which we did any of these things without developing a lot of useful technical tools and governance strategies that could be retooled or built on for higher-stakes problems later. Consider some examples.
- For self-driving cars to go well, we would need to effectively develop and iterate on techniques for robustness, reliability, and anomaly detection/handling.
- To make recommender systems go well, we would need to make a lot more progress on efficiently inferring what humans truly want in a way that is disentangled from what they seem to want.
- To minimize harms from AI weapons, we would need to introduce a lot of national and international laws and precedent for disincentivizing and responding to deadly AI.
- To minimize problems from discriminatory AI or the misuses of media generators (e.g. BLOOM or Stable Diffusion), we would need laws and precedent establishing definitions of harm and methods for recourse. Perhaps most importantly, establishing lots of laws and bureaucracy around AI systems like this may help to establish a legal regime that provides filters and obstacles to the deployment of risky systems. If auditing and slow timelines are good, so is this kind of bureaucracy.
See also this post.
Making allies and growing the AI safety field is useful
AI safety and longtermism (AIS&L) have a lot of critics, and in the past year or so, they seem to have grown in number and profile. Many of whom are people who work on and care a lot about near-term applications of AI. To some extent this is inevitable. Having an influential and disruptive agenda will inevitably lead to some pushback from competing ones. Haters are going to hate. Trolls are going to troll.
But AIS&L probably have more detractors than they should from people who should really be allies. Given how many forces in the world are making AI more risky, there shouldn’t be conflict between groups of people who are working on making it go better but in different ways. In what world could isolation and mutual dismissal between AIS&L people and people working on neartermist problems be helpful? There seem to be too many common adversaries and interests between the two groups to not be allies–especially for influencing AI governance. Having more friends and fewer detractors seems like it could only increase the political and research capital of the AIS&L community. There is also virtually no downside of being more popular.
I think that some negative press about AIS&L might be due to active or tacit dismissal of the importance of neartermist work by AIS&L people. Speaking for myself, I have had a number of conversations in the past few months with non AIS&L who seem sympathetic but have expressed feelings of dismissal by the community which has made them more hesistant to be involved. For this reason, we might stand to benefit a great deal from less parochialism and more friends.
Paul argues that
...companies using alignment to help train extremely conservative and inoffensive systems could lead to backlash against the idea of AI alignment itself.
But I think it is empirically, overwhelmingly clear that a much bigger concern when it comes to "backlash against the idea of AI alignment itself" comes from failures of the AIS&L community to engage with more neartermist work.
Thanks for reading--constructive feedback is welcome.
One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see https://www.alignmentforum.org/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without) doesn't seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving cars aren't misaligned in the sense that they could drive perfectly safely but choose not to, but rather are just unable to drive perfectly safely because some of their internal (learned) systems aren't sufficiently robust.