I don't know how or when to use a chi-squared test. What I did was assume - for the sake of checking my intuition - that the two sets of frequencies were indeed not made up.
It's the usual go-to frequentist test for comparing two sets of categorical data. You say you have 4 categories with 10/4/9/3 members and you have your null hypothesis and you're interested in how often, assuming the null, results as extreme or more extreme than your new data of 200/80/150/20 would appear. Like rolling a biased 4-sided dice.
(If you're curious, that specific made up example would be chisq.test(matrix(c(10,4,9,3,200,80,150,20), ncol = 2),) with a p-value of 0.4.)
The 1995 "study" has a sample size of $37Bn - this in fact turns out to match estimates of the entire DoD spend on IT projects in that year. So if these numbers are correct, then the frequencies must be precisely the probabilities for any given project to fall into the buckets A, B, C, D or E. What I did next was work out some reasonable assumptions for the 1979 set of frequencies. It is drawn from a sample of 9 projects totaling $6.5M, so the mean project cost in the sample is $755K, and knowing a few other facts we can compute a lower bound for the standard deviation of the sample.
This seems like a really weird procedure. You should be looking at the frequencies of each of the 4 categories, not messing around with means and standard deviations. (I mean heck, just what about 2 decades of inflation or military growth or cutbacks?) What, you think that the 1995 data implies that the Pentagon had $37bn/$755K=49006 different projects?
I don't know Python or NumPY and your formatting is messed up, so I'm not sure what exactly you're doing. (One nice thing about using precanned routines like R's chisq.test: at least it's relatively clear what you're doing.)
But we also find an earlier (1979) study, with a more credible primary source. Its five categories are labeled exactly the same, its sample size is much smaller - 9 projects for $7 million total. The allocation is nearly the same: A:47%, B: 29%, C: 19%, D: 3%, E: 2%.
Looking closer, I'm not sure this data makes sense. 0.02 9 is... 0.18. Not a whole number. 47% 9 is 4.23. Also not a positive integer or zero. 0.29 * 9 is 2.61.
Sure, the percentages do sum to 100%, but D and E aren't even possible: 1/9 = 11%!
Looking closer, I'm not sure this data makes sense. 0.02 * 9 is... 0.18. Not a whole number.
Basically, that's you saying exactly what is making me say "the coincidence is implausible". A sample of 9 will generally not contain an instance of something that comes up 2% of the time. Even more seldom will it contain that and an instance of something that comes up 3% of the time.
So, in spite of appearances, it seems as if our respective intuitions agree on something. Which makes me even more curious as to which of us is having a clack and where.
What I'm trying to figure out is, how to I determine whether a source I'm looking at is telling the truth? For an example, let's take this page from Metamed: http://www.metamed.com/vital-facts-and-statistics
At first glance, I see some obvious things I ought to consider. It often gives numbers for how many die in hospitals/year, but for my purposes I ought to interpret it in light of how many hospitals are in the US, as well as how many patients are in each hospital. I also notice that as they are trying to promote their site, they probably selected the data that would best serve that purpose.
So where do I go from here? Evaluating each source they reference seems like a waste of time. I do not think it would be wrong to trust that they are not actively lying to me. But how do I move from here to an accurate picture of general doctor competence?