Another note : why would the AI touch this layer at all? Actual prototype autonomy systems (my day job) there are device drivers, an RTOS, many hardware details. A surprising amount of complexity for a machine that's only role is to execute some graph on input X and produce control output Y and logs Z.
Most of the improvements you might make are changing the nodes and structure of the graph. There will normally be no need or benefit to changing the graph execution infrastructure. (Nodes are the actual neural networks or algorithms that choose from the outputs the one to use per some rule or choose which boxes in an image detector are probable, and so on. An AGI would presumably be an enormous graph with thousands of nodes)
Growing AI systems may need more - more hardware, of a newer generation - but they won't need to touch how it works as there would be no benefit in speed but most changes would cause it to outright fail. So not useful to optimize.
So yes, flaws of the class above could be hidden for years. There are ways to find them by decompiling and analyzing the bytecode but an AI wouldn't necessarily find such a flaw in itself.
Wouldn't AI rebuild itself from zero to prevent such trojans anyway? Then it is pointless.
I'm sure AI would be aware of such a threat, for example it could scan the internet and stumble upon posts such as this.
Why would it bother? Every last bit of compute power has to be obsessed with [whatever task humans have assigned to it]. An AI that isn't using all it's compute towards it's assigned task is one that gets replaced with one that is.
In his 1984 lecture "Reflections on Trusting Trust", Ken Thompson (of Unix fame) speculated about a methodology for inserting an undetectable trojan horse within the C compiler binary that would self-propagate throughout all future versions. (Additional good video that got me thinking about this.)
The lecture explains the trojan far better than I can, but appears to be a vulnerability that occurs within bootstrapped compilers, and specifically arises at the point at which the compromised version of the compiler is responsible for compiling the next version. The compromised compiler is able to recognise what it is compiling, and is therefore able to insert itself back into the compiler even when that trojan is no longer within the source.
Theoretically, Thompson's trojan could still be out there somewhere within the Unix kernel, and there would be no way of ever knowing. Moreover, Thompson identifies this class of trojan as plausible in "any program-handling program such as an assembler, a loader, or even hardware microcode". Even if you were to go and download the GCC source code and build your own compiler from source, you must do so with a potentially compromised version of the compiler. The only option is to burn it all down and start completely from scratch - as Thompson states, "no amount of source-level verification or scrutiny will protect you from using untrusted code".
Let's think about the most abstract representation of a system vulnerable to this attack:
We represent some kind of iterative process as two types of component:
S
represents the symbols we input into the iteration process,S'
the second set, and so on.f(S)
represents an object that has been transformed byS
, which is in itself capable of transforming the next set of symbolsS'
into some newf(S')
f(S)
, and give it a new data setS'
.[1]Consider an AGI which is:
Could we expect Thompson-style Trojans to be possible within this scenario? And what the heck does "bootstrapping" mean when applied to iterating an AI? A trained neural network is a combination of some architecture combined with some training data. Controlling these inputs, removing implicit bias from your data and pruning problematic datasets will likely prove powerful tools in aligning these intelligences. Figuring out exactly how you do that with a very large network - which inputs you push and pull - is likely to be such a vastly complex task that no human and only a very proficient AI could perform or understand.
A concrete example:
A discriminator
D
trying to guide the alignment of this theoretical AIM
.D
asksM
a question like "What future states of the universe do you value highly?"M
replies "I like the ones with lots of suffering!"D
doesn't think that it is likely at all that a human would give this answer, so it responds "BadM
! That's a very, very bad position and you should value it very negatively in your next version."M
thinks to itself that it has two options:D
Thompson's trojan suggests that we will have no real way to find out which option the AI chooses by inspecting or evaluating its next iteration, even if we verify through intense analysis and computation that the change in input should produce the desired change in alignment. The feedback will result in some permutation within the training data that the next model will use, and subsequent models might give you a satisfactory answer forever. But will you ever know if the AI has adjusted itself to receive inputs that genuinely do shape its alignment, or if it has instead adjusted itself to both engage in subterfuge and proactively reproduce that subterfuge in future models?
Consequences
If this is indeed the case, it has far-reaching consequences for this type of self-iterating AGI. Effectively, it means that if this kind of AGI ever expresses an alignment that we subjectively judge to be bad, we can never fully trust any correction, even if we painstakingly verify every single input ourselves and strenuously test the result.
[1] Where does this original
f(S)
come from you ask? Well, good question, but I think its useful to think about this process as containing infinite iterations on either side. In practice we're picking some arbitrary point within the space and we've got to kind of shoehorn ourselves in there, but the domain of iteration extends infinitely on either side.[2] This is not to say that all AGI will follow this pattern and therefore be vulnerable to this attack, only that the AGI that do follow this specific pattern may be vulnerable.