We can subdivide the security story based on the ease of fixing a flaw if we're able to detect it in advance. For example, vulnerability #1 on the OWASP Top 10 is injection, which is typically easy to patch once it's discovered. Insecure systems are often right next to secure systems in program space.
Insecure systems are right next to secure systems, and many flaws are found. Yet, the larger systems (the company running the software, the economy, etc) manage to correct somehow. It's because there are mechanisms in the larger systems poised to patch the software when flaws are discovered. Perhaps we could fit and optimize this flaw-exploit-patch-loop in security as a technique for AI alignment.
If the security story is what we are worried about, it could be wise to try & develop the AI equivalent of OWASP's Cheat Sheet Series, to make it easier for people to find security problems with AI systems. Of course, many items on the cheat sheet would be speculative, since AGI doesn't actually exist yet. But it could still serve as a useful starting point for brainstorming.
This sounds like a great idea to me. Software security has a very well developed knowledge base at this point and since AI is software, there should be many good insights to port.
What possibilities aren't covered by the taxonomy provided?
Here's one that occurred to me quickly: Drastic technological progress (presumably involving AI) destabilizes society and causes strife. In this environment with more enmity, safety procedures are neglected and UFAI is produced.
Trying to create an FAI from alchemical components is obviously not the best idea. But it's not totally clear how much of a risk these components pose, because if the components don't work reliably, an AGI built from them may not work well enough to pose a threat.
I think that using alchemical components in an possible FAI can lead to a serious risk if the people developing it aren't sufficiently safety conscious. Suppose that either implicitly or explicitly, the AGI is structured using alchemical components as follows:
In the process of building an AGI using alchemical means, all of the above will be incrementally improved to a point where they are "good enough". The AI is forming accurate beliefs about the world, and is making plans to get things that the researchers want. However, in a setup like this, all of the classic AI safety concerns come into play. Namely, the AI has an incentive to upgrade the first 2 modules and preserve the utility function. Since the utility function is only "good enough", this becomes the classic setup for Goodhart and we get UFAI.
Even in a situation where the AI does not participate in it's own further redesign, it's effective ability to optimise the world increases as it gets more time to interact with it. As a result, an initially well behaved AGI might eventually wander into a region of state space where it becomes unfriendly using only capabilities comparable to that of a human.
That said, remains to be seen if researchers will build AI's with this kind of architecture without additional safety precautions. But we do see model free RL variants of this general architecture such as Guided Cost Learning and Deep reinforcement learning from human preferences.
As a practical experiment to validate my reasoning, one could replicate the latter paper using a weak RL algorithm, and then see what happens if it's swapped out with a much stronger algorithm after learning the reward function. (Some version of MPC maybe?)
From what I understand about humans, they are so self-contradictory and illogical that any AGI that actually tries to optimize for human values will necessarily end up unaligned, and that the best we can hope for is that whatever we end up creating will ignore us and will not need to disassemble the universe to reach whatever goals it might have.
This seems as misled as arguing that any AGI will obviously be aligned because to turn the universe into paperclips is stupid. We can conceivably build an AGI that is aware that humans are self-contradictory and illogical, and therefore won't assume that they are rational because it knows that that would make it misaligned. We can do at least as well as an overseer that intervenes on needless death and suffering as it would happen.
If you mean an AGI that optimizes for human values exactly as they currently are will be unaligned, you may have a point. But I think many of us are hoping to get it to optimize for an idealized version of human values.
Epistemic status: fake framework
To do effective differential technological development for AI safety, we'd like to know which combinations of AI insights are more likely to lead to FAI vs UFAI. This is an overarching strategic consideration which feeds into questions like how to think about the value of AI capabilities research.
As far as I can tell, there are actually several different stories for how we may end up with a set of AI insights which makes UFAI more likely than FAI, and these stories aren't entirely compatible with one another.
Note: In this document, when I say "FAI", I mean any superintelligent system which does a good job of helping humans (so an "aligned Task AGI" also counts).
Story #1: The Roadblock Story
Nate Soares describes the roadblock story in this comment:
(emphasis mine)
The roadblock story happens if there are key safety insights that FAI needs but AGI doesn't need. In this story, the knowledge needed for FAI is a superset of the knowledge needed for AGI. If the safety insights are difficult to obtain, or no one is working to obtain them, we could find ourselves in a situation where we have all the AGI insights without having all the FAI insights.
There is subtlety here. In order to make a strong argument for the existence of insights like this, it's not enough to point to failures of existing systems, or describe hypothetical failures of future systems. You also need to explain why the insights necessary to create AGI wouldn't be sufficient to fix the problems.
Some possible ways the roadblock story could come about:
Maybe safety insights are more or less agnostic to the chosen AGI technology and can be discovered in parallel. (Stuart Russell has pushed against this, saying that in the same way making sure bridges don't fall down is part of civil engineering, safety should be part of mainstream AI research.)
Maybe safety insights require AGI insights as a prerequisite, leaving us in a precarious position where we will have acquired the capability to build an AGI before we begin critical FAI research.
Maybe there will be multiple subsets of the insights needed for FAI which are sufficient for AGI.
Story #2: The Security Story
From Security Mindset and the Logistic Success Curve:
The main difference between the security story and the roadblock story is that in the security story, it's not obvious that the system is misaligned.
We can subdivide the security story based on the ease of fixing a flaw if we're able to detect it in advance. For example, vulnerability #1 on the OWASP Top 10 is injection, which is typically easy to patch once it's discovered. Insecure systems are often right next to secure systems in program space.
If the security story is what we are worried about, it could be wise to try & develop the AI equivalent of OWASP's Cheat Sheet Series, to make it easier for people to find security problems with AI systems. Of course, many items on the cheat sheet would be speculative, since AGI doesn't actually exist yet. But it could still serve as a useful starting point for brainstorming flaws.
Differential technological development could be useful in the security story if we push for the development of AI tech that is easier to secure. However, it's not clear how confident we can be in our intuitions about what will or won't be easy to secure. In his book Thinking Fast and Slow, Daniel Kahneman describes his adversarial collaboration with expertise researcher Gary Klein. Kahneman was an expertise skeptic, and Klein an expertise booster:
Our intuitions are only as good as the data we've seen. "Gathering data" for an AI security cheat sheet could helpful for developing security intuition. But I think we should be skeptical of intuition anyway, given the speculative nature of the topic.
Story #3: The Alchemy Story
Ali Rahimi and Ben Recht describe the alchemy story in their Test-of-time award presentation at the NeurIPS machine learning conference in 2017 (video):
(emphasis mine)
The alchemy story has similarities to both the roadblock story and the security story.
From the perspective of the roadblock story, "alchemical" insights could be viewed as insights which could be useful if we only cared about creating AGI, but are too unreliable to use in an FAI. (It's possible there are other insights which fall into the "usable for AGI but not FAI" category due to something other than their alchemical nature--if you can think of any, I'd be interested to hear.)
In some ways, alchemy could be worse than a clear roadblock. It might be that not everyone agrees whether the systems are reliable enough to form the basis of an FAI, and then we're looking at a unilateralist's curse scenario.
Just like chemistry only came after alchemy, it's possible that we'll first develop the capability to create AGI via alchemical means, and only acquire the deeper understanding necessary to create a reliable FAI later. (This is a scenario from the roadblock section, where FAI insights require AGI insights as a prerequisite.) To prevent this, we could try & deepen our understanding of components we expect to fail in subtle ways, and retard the development of components we expect to "just work" without any surprises once invented.
From the perspective of the security story, "alchemical" insights could be viewed as components which are clearly prone to vulnerabilities. Alchemical components could produce failures which are hard to understand or summarize, let alone fix. From a differential technological development point of view, the best approach may be to differentially advance less alchemical, more interpretable AI paradigms, developing the AI equivalent of reliable cryptographic primitives. (Note that explainability is inferior to interpretability.)
Trying to create an FAI from alchemical components is obviously not the best idea. But it's not totally clear how much of a risk these components pose, because if the components don't work reliably, an AGI built from them may not work well enough to pose a threat. Such an AGI could work better over time if it's able to improve its own components. In this case, we might be able to program it so it periodically re-evaluates its training data as its components get upgraded, so its understanding of human values improves as its components improve.
Discussion Questions
How plausible does each story seem?
What possibilities aren't covered by the taxonomy provided?
What distinctions does this framework fail to capture?
Which claims are incorrect?