Thanks for checking! I think our main difference is that you use data from Metaculus prediction whereas I used Metaculus postdiction, which "uses data from all other questions to calibrate its result, even questions that resolved later." Right now, this gives Metaculus an average log score of 0.519 vs. the community's 0.419 (total questions: 885) for binary questions, 2.43 vs. 2.25 for 537 continuous questions, evaluated at resolve time.
Thanks for the comment and I enjoy reading the article! I basically agree with what you said and admit that I only get to touch a bit upon this important "multi-level interests problem" within the "domestic audience" section. I think it would depend a lot on (1) how diffused those war-relevant prediction services are and (2) the distribution of societal trust in them (e.g. whether they become politicalized), which would be country/context-specific and I did not come up with useful ways to further disentangle them on a general level.