In some recent discussions I have realized that there is a quite a nasty implied disagreement about whether AI alignment is a functional property or not, that is if your personal definition of whether an AI is "aligned" is purely a function of its input/output behavior irrespective of what kind of crazy things are going on inside to generate that behavior.
So I'd like to ask the community whether it is currently considered the mainstream take that 'Alignment' is functional (only input/output mapping matters) or whether the internal computation matters (it's not OK to think a naughty thought and then have some subroutine that cancels it, for example).
What it says: irrelevant
How it thinks: irrelevant
It has always been about what it can do in the real world.
If it can generate substantial amounts of money and buy server capacity or
hack into computer systems
then we got cyberlife, aka autonomous, rogue, self-sufficient AI, subject to darwinian forces on the internet, leading to more of those qualities, which improve its online fitness, all the way into a full-blown takeover.
in a world where mechinterp is not 100%, the answer is logically: input/output is what matters.
we won't be able to read the thoughts anyways, so why base our judgment on it?
but see my comment on why survival fitness in cyberspace is the one axis where most of the relevant input/output will be generated.