A problem with trying to isolated calibration is that on the true/false test, the subject could always assign 50% probability to both true and false and be right 50% of the time, achieving perfect calibration.
Interestingly, this problem can be avoided by taking the domain of possible answers to be the natural numbers, or n-dimensional Euclidean space, etc., over which no uniform distribution is possible, and then asking your test subject to specify a probability distribution over the whole space. This is potentially impractical, though, and I'm not certain it can't be gamed in other ways.
There are lots of scoring rules for probability assessments. Log scoring is popular here, and squared error also works.
But if I understand these correctly, they are combined measurements of both domain-ability and calibration. For example, if several people took a test on which they had to estimate their confidence in their answers to certain true or false questions about history, then well-calibrated people would have a low squared error, but so would people who know a lot about history.
So (I think) someone who always said 70% confidence and got 70% of the questions right would get a higher score than someone who always said 60% confidence and got 60% of the questions right, even though they are both equally well calibrated.
The only pure calibration estimates I've ever seen are calibration curves in the form of a set of ordered pairs, or those limited to a specific point on the cuve (eg "if ey says ey's 90% sure, ey's only right 60% of the time"). There should be a way to take the area under (or over) the curve to get a single value representing total calibration, but I'm not familiar with the method or whether it's been done before. Is there an accepted way to get single-number calibration scores separate from domain knowledge?