Consider reading this instead.
Here is some obvious advice.
I think a common failure mode when working on AI alignment[1] is to not focus on the hard parts of the problem first. This is a problem when generating a research agenda, as well as when working on any specific research agenda. Given a research agenda, there are normally many problems that you know how to make progress on. But blindly working on what seems tractable is not a good idea.
Let's say we are working on a research agenda about solving problems A, B, and C. We know that if we find solutions to A, B, and C we will solve alignment. However, if we can't solve even one subproblem, the agenda would be doomed. If C seems like a very hard problem, that you are not sure you can solve, it would be a bad idea to flinch away from C and work on problem A instead, when A seems so much more manageable.
If solving A takes a lot of time and effort, all of that time and effort would be wasted, if you can't solve C in the end. It's especially worrisome when A has tight fightback loops, such that you constantly feel like you are making progress. Or when it is just generally fun to work on A.
Of course, it can make sense to work on A first if you expect this to help you solve C, or at least give you more information on its tractability. The general version of this is illustrated by considering that you have a large list of problems that you need to solve. In this case, focusing on problems that will provide you with information that will be helpful for solving many of the other problems can be very useful. But even then you should not lose sight of the hard problems that might block you down the road.
The takeaway is that these two things are very different:
- Solving A as an instrumental subgoal in order to make progress on C, when C is a potential blocker.
- Avoiding C, because it seems hard, and instead working on A because it seems tractable.
Though I expect this to be a general problem that comes up all over the place. ↩︎
That is rather my point, yes. A solution to C would be most of the way to a solution to AI alignment , and would also solve management / resource allocation / corruption / many other problems. However, as things stand now a rather significant fraction of the entire world economy is directed towards mitigating the harms caused by the inability of principals to fully trust agents to act in their best interests. As valuable as a solution to C would be, I see no particular reason to expect it to be possible, and an extremely high lower bound on the difficulty of the problem (most salient to me is the multi-trillion-dollar value that could be captured from someone who did robustly solve this problem).
I wish anyone who decides to tackle C the best of luck, but I expect that the median outcome of such work would be something of no value, and the 99th percentile outcome of such work would be something like a clever incentive mechanism which e.g. bounds the extent to which the actions of an agent can harm the principal while appearing to be helpful to zero in a perfect-information world, and which degrades gracefully in a world of imperfect information.
In the meantime, I expect that attacking subproblems A and B will have a nonzero amount of value even in worlds where nobody finds a robust solution to subproblem C (both because better versions of A and B may be able to shore up an imperfect or partial solution to C, and also because while robust solutions to A/B/C may be one sufficient set of components to solve your overall problem, they may not be the only sufficient set of components, and by having solutions to A and B you may be able to find alternatives to C).
Whether this is true (in the narrow sense of "the AI kills you because it made an explicit calculation and determined that the optimal course of action was to perform behaviors that result in your death" rather than in the broad sense of "you die for reasons that would still have killed you even in the absence of the particular AI agent we're talking about") is as far as I know still an open question, and it seems to me one where preliminary signs are pointing against a misaligned singleton being the way we die.
The crux might be whether you expect a single recursively-self-improving agent to take control of the light cone, or whether you don't expect a future where any individual agent can unilaterally determine the contents of the light cone.
For reference the CAP problem is one of those problems that sounds worse in theory than it actually is in practice: for most purposes -- significant partitions are rare, and " when a partition happens, data may not be fully up-to-date, and writes may be dropped or result in conflicts" is usually a good-enough solution in those rare cases.