In some recent discussions I have realized that there is a quite a nasty implied disagreement about whether AI alignment is a functional property or not, that is if your personal definition of whether an AI is "aligned" is purely a function of its input/output behavior irrespective of what kind of crazy things are going on inside to generate that behavior.
So I'd like to ask the community whether it is currently considered the mainstream take that 'Alignment' is functional (only input/output mapping matters) or whether the internal computation matters (it's not OK to think a naughty thought and then have some subroutine that cancels it, for example).
It may be that generating horrible counterfactual lines of thought for the purpose of rejecting them is necessary for getting better outcomes. To the extent that you have a real dichotomy here, I would say that the input/output mapping is the thing that matters. I want all humans to not end up worse off for inventing AI.
That said, humans may end up worse off by our own metrics if we make AI that is itself suffering terribly based off of its internal computation or it is generating ancestor torture simulations or something. Technically that is an alignment issue, although I worry that most humans won't care if the AI is suffering if they don't have to look at it suffer and it generates outputs that humans like aside from that hidden detail.