Oh, excellent. I do love data. What is the format (what is the maximum amount of information you have about each individual)?
Given that you already have the data, (and you probably have reason to suspect that individuals were not trying to game the test?), I suspect the best way is to graph both accuracy and anticipated accuracy against the chosen demographic, and then for all your readers who want numbers, compute either the ratio or the difference of those two and publish the PMCC of that against the demographic (it's Frequentist, but it's also standard practice, and I've had papers rejected that don't follow it...).
...PMCC...
I'm not sure what the Pacific Mennonite Children's Choir has to do with it... oh wait, nevermind.
There are lots of scoring rules for probability assessments. Log scoring is popular here, and squared error also works.
But if I understand these correctly, they are combined measurements of both domain-ability and calibration. For example, if several people took a test on which they had to estimate their confidence in their answers to certain true or false questions about history, then well-calibrated people would have a low squared error, but so would people who know a lot about history.
So (I think) someone who always said 70% confidence and got 70% of the questions right would get a higher score than someone who always said 60% confidence and got 60% of the questions right, even though they are both equally well calibrated.
The only pure calibration estimates I've ever seen are calibration curves in the form of a set of ordered pairs, or those limited to a specific point on the cuve (eg "if ey says ey's 90% sure, ey's only right 60% of the time"). There should be a way to take the area under (or over) the curve to get a single value representing total calibration, but I'm not familiar with the method or whether it's been done before. Is there an accepted way to get single-number calibration scores separate from domain knowledge?