In some recent discussions I have realized that there is a quite a nasty implied disagreement about whether AI alignment is a functional property or not, that is if your personal definition of whether an AI is "aligned" is purely a function of its input/output behavior irrespective of what kind of crazy things are going on inside to generate that behavior.
So I'd like to ask the community whether it is currently considered the mainstream take that 'Alignment' is functional (only input/output mapping matters) or whether the internal computation matters (it's not OK to think a naughty thought and then have some subroutine that cancels it, for example).
There is no such disagreement, you just can't test all inputs. And without knowledge of how internals work, you may me wrong about extrapolating alignment to future systems.
What makes it rational is that there is an actual underlying hypothesis about how weather works, instead of vague "LLMs are a lot like human uploads". And weather prediction outputs numbers connected to reality we actually care about. And there is no alternative credible hypothesis that implies weather prediction not working.
I don't want to totally dismiss empirical extrapolations, but given the stakes, I would personally prefer for all sides to actually state their model of reality and how they think evidence changed it's plausibility, as formally as possible.