Hi there, my background is in AI research and recently I have discovered some AI Alignment communities centered around here. The more I read about AI Alignment, the more I have a feeling that the whole field is basically a fictional-world-building exercise.
Some problems I have noticed: The basic concepts (e.g. what are the basic properties of the AI that are being discussed) are left undefined. The questions answered are build on unrealistic premises about how AI systems might work. Mathiness - using vaguely defined mathematical terms to describe complex problems and then solving them with additional vaguely defined mathematical operations. Combination of mathematical thinking and hand-wavy reasoning that lead to preferred conclusions.
Maybe I am reading it wrong. How would you steelman the argument that AI Alignment is actually a rigorous field? Do you consider AI Alignment to be scientific? If so, how is it Popper-falsifiable?
Yeah, but also this is the sort of response that goes better with citations.
Like, people used to make a somewhat hand-wavy argument that AIs trained on goal X might become consequentialists which pursued goal Y, and gave the analogy of the time when humans 'woke up' inside of evolution, and now are optimizing for goals different from evolution's goals, despite having 'perfect training' in some sense (and the ability to notice the existence of evolution, and its goals). Then eventually someone wrote Risks from Learned Optimization in Advanced Machine Learning Systems in a way that I think involves substantially less hand-waving and substantially more specification in detail.
Of course there are still parts that remain to be specified in detail--either because no one has written it up yet (Risks from Learned Optimization came from, in part, someone relatively new to the field saying "I don't think this hand-wavy argument checks out", looking into it a bunch, being convinced, and then writing it up in detail), or because we don't know what we're looking for yet. (We have a somewhat formal definition of 'corrigiblity', but is it the thing that we actually want in our AI designs? It's not yet clear.)