What source code and what machine code is actually being executed on some particular substrate is an empirical fact about the world, so in general, an AI (or a human) might learn it the way we learn any other fact - by making inferences from observations of the world.
For example, maybe you hook up a debugger or a waveform reader to the AI's CPU to get a memory dump, reverse engineer the running code from the memory dump, and then prove some properties you care about follow inevitably from running the code you reverse engineered.
In general though, this is a pretty hard, unsolved problem - you probably run into a bunch of issues related to embedded agency pretty quickly.
To get an intuition for why these problems might be possible to solve, consider the tools that humans who want to cooperate or verify each other's decision processes might use in more restricted circumstances.
For example, maybe two governments which distrust each other conduct a hostage exchange by meeting in a neutral country, and bring lots of guns and other hostages they intend not to exchange that day. Each government publicly declares that if the hostage exchange doesn't go well, they'll shoot a bunch of the extra hostages.
Or, suppose two humans are in a True Prisoner's Dilemma with each other, in a world where fMRI technology has advanced to the point where we can read off semantically meaningful inner thoughts from a scan in real-time. Both players agree to undergo such a scan and make the scan results available to their opponent, while making their decision. This might not allow them to prove that their opponent will cooperate iff they cooperate, but it will still probably make it a lot easier for them to robustly cooperate, for valid decision theoretic reasons, rather than for reasons of honor or a spirit of friendliness.
Yes but part of what makes interpretability hard is it may be incomputable. And part of what makes making interpretable models difficult to create is that there may be some inner deception you are missing. I think any argument that makes alignment difficult symmetrically makes two agents working together difficult. I mean think about it, if agent 1 can cooperate with agent 2, agent 1 can also bootstrap a version of agent 2 aligned with itself.
Intuitively interpreting an AI feels like we are trying to break some cryptographic hash, but much harder. For example, as humans even interpreting gpt-2 is a monumental task for us despite our relatively vastly superior computational abilities and knowledge. After all, something that’s 2 OOMs superior (gpt-4) could only confidently interpret a fraction of a percent of neurons, and only with certain confidence. It’s likely these are relatively trivial ones too.
Therefore, I think the only way for agent 1 to confirm the motives of agent 2 and eliminate unknown unknowns is to fully simulate it, assuming there isn’t some shortcut to interpretability. And of course fully simulating it requires it to accomplish boxing agent 2, and probably have a large capability edge over it.
I don’t think the TSA analogy is perfect either. There is no (Defect, Cooperate) case. The risks in it are known unknowns. You have a good understanding of what happens if you fail as you just go to jail, but with adversarial agents, they have absolutely no understanding of what happens to them if they are wrong. Being wrong here means you get eliminated. So (D, C)’s negative expectation for example is unbounded. You also have a reasonable expectation of what TSA will do to verify you aren’t breaking the law. Just their methodology of scanning you eliminates a lot of detection methods from the probability space. You have no idea what another agent will do. Maybe they sent an imperfect clone to communicate with you, maybe there’s a hole in the proposed consensus protocol you haven’t considered. The point is this is one giant unknown unknown that can’t be modeled as you are trying to predict what someone with greater capabilities than you can do.
Lastly, I think this chain of reasoning breaks down if there is some robust consensus protocol or some way to make models fully interpretable. However if such a method exists, we would probably just initially align the AGI to begin with as our narrow AI would help us create interpretable AGI, or a robust system of detecting inner deception.