What source code and what machine code is actually being executed on some particular substrate is an empirical fact about the world, so in general, an AI (or a human) might learn it the way we learn any other fact - by making inferences from observations of the world.
For example, maybe you hook up a debugger or a waveform reader to the AI's CPU to get a memory dump, reverse engineer the running code from the memory dump, and then prove some properties you care about follow inevitably from running the code you reverse engineered.
In general though, this is a pretty hard, unsolved problem - you probably run into a bunch of issues related to embedded agency pretty quickly.
To get an intuition for why these problems might be possible to solve, consider the tools that humans who want to cooperate or verify each other's decision processes might use in more restricted circumstances.
For example, maybe two governments which distrust each other conduct a hostage exchange by meeting in a neutral country, and bring lots of guns and other hostages they intend not to exchange that day. Each government publicly declares that if the hostage exchange doesn't go well, they'll shoot a bunch of the extra hostages.
Or, suppose two humans are in a True Prisoner's Dilemma with each other, in a world where fMRI technology has advanced to the point where we can read off semantically meaningful inner thoughts from a scan in real-time. Both players agree to undergo such a scan and make the scan results available to their opponent, while making their decision. This might not allow them to prove that their opponent will cooperate iff they cooperate, but it will still probably make it a lot easier for them to robustly cooperate, for valid decision theoretic reasons, rather than for reasons of honor or a spirit of friendliness.
We might develop schemes for auditable computation, where one party can come in at any time and check the other party's logs. They should conform to the source code that the second party is supposed to be running; and also to any observable behavior that the second party has displayed. It's probably possible to have logging and behavioral signalling be sufficiently rich that the first party can be convinced that that code is indeed being run (without it being too hard to check -- maybe with some kind of probabilistically checkable proof).
However, this only provides a positive proof that certain code is being run, not a negative proof that no other code is being run at the same time. This part, I think, inherently requires knowing something about the other party's computational resources. But if you can know about those, then
it's possibleit might be possible. For a perhaps dystopian example, if you know your counterparty has compute A, and the program you want them to run takes compute B, then you could demand they do something (difficult but easily checkable) like invert hash functions, that'll soak up around A-B of their compute, so they have nothing left over to do anything secret with.In the soaking-up-extra-compute case? Yeah, for sure, I can only really picture it (a) on a very short-term basis, for example maybe while linking up tightly for important negotations (but even here, not very likely). Or (b) in a situation with high power asymmetry. For example maybe there's a story where 'lords' delegate work to their 'vassals', but the workload intensity is variable, so the vassals have leftover compute, and the lords demand that they spend it on something like blockchain mining. To compensate for the vulnerability this induces, the lords would also provide protection.