In some recent discussions I have realized that there is a quite a nasty implied disagreement about whether AI alignment is a functional property or not, that is if your personal definition of whether an AI is "aligned" is purely a function of its input/output behavior irrespective of what kind of crazy things are going on inside to generate that behavior.
So I'd like to ask the community whether it is currently considered the mainstream take that 'Alignment' is functional (only input/output mapping matters) or whether the internal computation matters (it's not OK to think a naughty thought and then have some subroutine that cancels it, for example).
It seems like if there is any non-determinism at all, there's always going to be an unavoidable potential for naughty thoughts, so whatever you call the "AI" must address them as part of its function anyway- either that or there is a deterministic solution?