What source code and what machine code is actually being executed on some particular substrate is an empirical fact about the world, so in general, an AI (or a human) might learn it the way we learn any other fact - by making inferences from observations of the world.
For example, maybe you hook up a debugger or a waveform reader to the AI's CPU to get a memory dump, reverse engineer the running code from the memory dump, and then prove some properties you care about follow inevitably from running the code you reverse engineered.
In general though, this is a pretty hard, unsolved problem - you probably run into a bunch of issues related to embedded agency pretty quickly.
To get an intuition for why these problems might be possible to solve, consider the tools that humans who want to cooperate or verify each other's decision processes might use in more restricted circumstances.
For example, maybe two governments which distrust each other conduct a hostage exchange by meeting in a neutral country, and bring lots of guns and other hostages they intend not to exchange that day. Each government publicly declares that if the hostage exchange doesn't go well, they'll shoot a bunch of the extra hostages.
Or, suppose two humans are in a True Prisoner's Dilemma with each other, in a world where fMRI technology has advanced to the point where we can read off semantically meaningful inner thoughts from a scan in real-time. Both players agree to undergo such a scan and make the scan results available to their opponent, while making their decision. This might not allow them to prove that their opponent will cooperate iff they cooperate, but it will still probably make it a lot easier for them to robustly cooperate, for valid decision theoretic reasons, rather than for reasons of honor or a spirit of friendliness.
I don't know if that's possible in full generality, or with absolute confidence, but you can't really know anything with absolute confidence. Here's my layman's understanding/thoughts.
Nevertheless, if one system gives another what it claims is its code+current state, then that system should be able to run the first system in a sandbox on its own hardware, predict the system's behavior in response to any stimulus, and check that against the real world. If both systems share data this way, they should each be able to interact with the sandboxed version of the other, and then come together later, each knowing how the simulated interaction that the other had already went. They can test this, and continue testing, it by monitoring the other's behavior over time.
Some of my own intuitions about this: