I think a better active-inference-inspired perspective that fits well with the distinction Anna is trying to make here is that of representing preferences as probability distributions over state/observation trajectories, the idea being that one assigns high "belief in" probabilities to trajectories that are more desirable. This "preference distribution" is distinct from the agent's "prediction distribution", which tries to anticipate and explain outcomes as accurately as possible. Active Inference is then cast as the process of minimising the KL divergence between these two distributions.
A couple of pointers which articulate this idea very nicely in different contexts:
I like these posts! Could you share any pointers or thoughts on how this framework extends to situations where you're not passively observing a stream of information, but your actions affect the incoming bits?
Relatedly, is there a way to capture the idea of performing maximally informative experiments to extract as much structure from the (anticipated) bit stream as possible?
This paper and this one are to my knowledge the most recent technical expositions of the FEP. I don't know of any clear derivations of the same in the discrete setting.