The following is slightly edited from a pitch I wrote for a general audience. I've added blog-specific content afterwards.


Information technology allows for unprecedented levels of collaboration and debate. When an issue arises people communicate freely, are held accountable for misrepresentations, and are celebrated for cogent analysis. We share information and opinion better than ever before. And then we leave the actual decision up to one person, or a tiny committee, or a poll of a population that for the most part wasn't paying attention, or at best an unpredictably irrational market. The one thing we still don't aggregate in a sophisticated way is human judgment.

Organizations evolve complex decision-making structures because variance in human judgement is complicated. We try to put the most competent person in charge—but there is wisdom in crowds, and so a wise leader gets buy-in from a broad pool of competent subordinates. We must constantly try to evaluate who has the best record, to see who's been right in the past...and we get it wrong all the time. We overestimate our own competence. In hindsight, we misremember the right decision as being obvious. We trust the man with the better hair. Any organization with group buy-in on decisions amasses a solid amount of data on the competence of its members, but it does not curate or use this data effectively.

We can do better, using current technology, some simple software, and some relatively simple math. The solution is called histocracy. It is most easily explained with a use case.

The H Foundation is a hypothetical philanthropic organization, with a board of twelve people overseeing a large fund. Each year, they receive and review several hundred grant applications, and choose a few applicants to give money to. Sometimes these applicants use the money effectively, and sometimes they fail. Often an applicant they turn down will get funding elsewhere and experience notable success or failure. In short, it is often obvious to the board in hindsight whether they made the right decision. For each application, the yay or nay of each board member is recorded. If and when, later, the board reaches a consensus on whether that application should have been approved, this consensus is recorded as well. The result is that each board member accumulates a score. Alice's votes have been right 331 times and wrong 59 times, while Bob's votes have been right 213 times and wrong 110 times (they weren't always present for the same votes). Already from this raw data we can see that Alice's opinion should count for more than Bob's. With a computer's ease with arithmetic, we can quantify this. Some math is given in an appendix; here it suffices to say that it would be reasonable to give Alice's vote a little over 7/4ths the weight of Bob's: if the board is to maximize its chance of making the correct choice, 4 Alices should be able to outvote 7 Bobs. The board members each connect to a shared server and vote on each application; the software performs the relevant calculations and determines the victor.

In this system, the board members perform the massively complex task of evaluating the applicants, a job requiring expert judgment and intuition, while the computer dispassionately and precisely evaluates the board. The result is a system wiser than any individual board member.

When scaling this solution up to a large business with thousands of employees, the math stays the same while the interface changes. Decisions need to be shared and discussed on a corporate intranet, and tagged by type so that employees can find and vote on only those decisions they feel competent to vote on. Employees who try to make decisions on matters beyond their competence will fail to accumulate enough voting weight to skew the decision; this means that decisions in all areas can be opened to the entire field. Managers should be encouraged to reframe decisions they are pondering as corporation-wide referenda. Evaluating a decision in hindsight should in this case be reserved to the owners and shareholders, or to a system or charter they have approved.

Expanding the scale even further, the same approach could be applied to advice, solicited or unsolicited. Consider a site to which clients could pay to submit polls on decisions that concerned them. The polls would be conducted and reported histocratically. The client would later be asked to report whether the advice given by the community turned out to be correct. Prizes and recognition could be given to those solvers who accumulate the highest voting weights, thereby incentivizing participation and excellence. For unsolicited advice, a similar approach could be used with petitions.

In summary, we note that human judgment is essentially a set of predictions, and thus can be judged empirically and aggregated mathematically. Group decision-making is such an omnipresent and consequential task that optimizing it may be the single most important thing we can do. Let's do it rigorously, and let's do it now.


 

On Sunday, I posted a call for solutions in advance of this post. Which is a weird thing to do, but I have a terror of Irrevocable Actions, and I can't untell you something. (Coincidentally, at the same time as people were chiding me for this, a discussion started about my also mildly eccentric decision to put my play behind a semipermeable paymembrane, which has a similar explanation; it's easier to make something free that was once non-free than the reverse, and in many circles charging for something is actually higher-status.)

I didn't mention prediction markets because I didn't want people to anchor on it—it's just a hop, skip, and a jump from futarchy to histocracy, so that would obviate the point. As expected, people went there immediately anyway, and from there to something very close to my idea. Much of the discussion centered around the difficulty of creating a well-defined charter. While I certainly agree that a quantifiable group utility function is usually difficult, if you go up a level of meta you'll see that well-defined charters are everywhere: a decision is correct if and only if the people in power judge it to have been correct. To be a democracy, we don't need to explicitly vote on values—we just need to let people vote on consequences in accordance with their values. The king's order may be ambiguously worded, but your true duty is clear: please the king.

There are some clear advantages to histocracy over futarchy: most relevantly, I believe histocracy will work well on a small scale, while prediction markets require a large crowd. Given enough time and participation, histocracy will inevitably beat a market. There's less moral hazard, and less vulnerability to manipulation.

Futarchy beats histocracy in that there's a built-in incentive to participate and excel: but people vote in elections and serve on non-profit boards for free, so I don't see a huge need to inject cash. Futarchy allows for individual actors to express degrees of confidence in a way that my model of histocracy doesn't, but this could be remedied where feasible. And Hanson's ideas for how to judge consequences in hindsight might be appropriate for some histocracies.

The potential pitfalls of histocracy depend on the specific implementation. I see politics, in the blue vs. green mind-killing sense, and difficulty of evaluating consequences even in hindsight as the two major Achilles heels; but as far as I can see these are universal. There is a danger of a subgroup amassing a large voting weight, then abusing it in the window before they are removed from power, which can perhaps best be guarded against with some sort of constitutional system, perhaps even one formally incorporated into the system as a high Bayesian prior against certain classes of actions being correct.

I should also concede up front that my “mathematical” appendix glosses over the serious AI challenge of doing this right: hopefully, the computing power available to a histocracy will grow much faster than the number of voters. Log(LaPlace(Record)) will double-count terribly in large groups, but it does have the advantage of being simple and transparent—entrusting your government to a black box is scary.

Groups giving histocracy a try should start by making it nonbinding. Only when it's working better than your current system should it be adopted. Unless, of course, your current system is a majority vote, in which case you might as well start using it right away.

 

New to LessWrong?

New Comment
63 comments, sorted by Click to highlight new comments since: Today at 11:50 AM

I like the idea of measurement. The problem, though, is that you get what you measure, not what you wanted to measure.

Suppose Charlie is risk-averse, and only approves projects with a 95% chance of meeting expectations. David is risk-neutral, and will approve projects that have a positive EV that are significantly higher than other available projects. Oftentimes, they're speculative and will only exceed expectations about 10% of the time, since they only have about a 10% chance of succeeding.

Charlie will get about nine times as many votes as David, eventually. If David votes against Charlie's projects as too bland and too low EV, this will go even worse for David, as eventually only Charlie's projects will be approved and David will be recorded as pessimistic on all of them.

Decision-making is not a logistic regression problem, and so I am pessimistic about logistic regression approaches applied to it. I agree that measuring decision-making ability is a very important task, but approaches like Market-Based Management seem far more promising.

[-][anonymous]12y150

Also known as Goodhart's law.

If the organization is risk-averse, it doesn't want risk-neutral voters to gain influence. If it's risk-neutral, then it should incorporate opportunity costs when judging projects in hindsight. Furthermore, if in hindsight a rejected project still appears to have had a high positive EV, the org should register the rejection of the project as a mistake.

[-][anonymous]12y90

Suppose the organisation is risk-neutral, and Charlie abstains from the sub-95% chance projects rather than rejecting them (in a large organisation that makes many decisions you can't expect everyone to vote on everything). He also rejects the sub-5% projects.

By selectively only telling you what you already knew, Charlie builds up a reputation of being a good predictor, as opposed to David, who is far more often wrong but who is giving actual useful input.

Furthermore, if in hindsight a rejected project still appears to have had a high positive EV, the org should register the rejection of the project as a mistake.

This misses the heart of that criticism: mistakes have different magnitudes.

I might as well make this a policy: I will downvote any post that has non-standard text formatting (sizes, fonts, line spacing, etc.).

[-][anonymous]12y-30

That's a little harsh. I'm sure there's a reasonably good idea somewhere in that typographical train-wreck.

I still read it, and will remove the downvote if the typography is fixed. But I consider that error as much an impediment to reading as glaring spelling or grammatical errors, which are worth downvoting.

[-][anonymous]12y10

That's less harsh. :)

Your tease excited me since I recently started grappling with this issue. Unfortunately, I'm underwhelmed. If the group deals only with binary decisions, participants have a single underlying competency, participation can be suitably restricted, participants don't have strong biases, decisions can be reliably assessed right or wrong, etc, then you have an elegant solution.

There are some clear advantages to histocracy over futarchy: most relevantly, I believe histocracy will work well on a small scale, while prediction markets require a large crowd. Given enough time and participation, histocracy will inevitably beat a market. There's less moral hazard, and less vulnerability to manipulation.

These claims seem completely unfounded. Prediction markets don't require a crowd. If implemented through a market-maker, you can get by with a single participant. PMs have issues, especially when used to make decisions, but this proposal is rife with manipulation opportunities -- accumulating competency and "spending" it to sway decisions, manipulating if a decision is counted as a success or judged at all, altering the order of decisions to accumulate competency or harm that of others, collusion to build competency of at least one individual (worthwhile since the weights are convex). The worse manipulation of prediction markets I'm aware of that wouldn't also apply to this is for traders to mislead others for later profit, which wouldn't affect the final probabilities used for decisions.

Besides PMs, there is work being done in Bayesian truth serum, peer prediction, and other collective revelation mechanisms that don't require verification of results for scoring, but still result in truthful answers.

For each application, the yay or nay of each board member is recorded. If and when, later, the board reaches a consensus on whether that application should have been approved, this consensus is recorded as well. The result is that each board member accumulates a score.

Wait a minute here - how is consensus reached here? Say I voted nay on an application, so it was rejected, because a (weighted) majority voted nay too.

1) How are we supposed to figure out after the fact whether we should have voted "aye" on the application? (Barring easy cases like "the exact same application was then accepted by another foundation and turned into a huge success".)

2) If the consensus turned out that "aye" was the correct answer, I would in effect lose power, right? So it would be in my interest that the consensus does not, turn out to be "aye", so it's in my interest to vote "nay" during the consensus, or delay it, etc. - and the same goes for the majority who voted "nay" in the first place, which makes a consensus that the majority was wrong unlikely in many ambiguous cases.

Yeah, it might be necessary in some cases to randomly select a jury whose weights aren't affected by their decision.

Yes, it's a very good idea, although also obvious enough that I wouldn't pin your ego on claiming it as your own. I don't know how far back the idea's history goes, but I've discussed it before. People usually dismiss it as impractical for political reasons - politicians will never vote to implement a system for tracking the performance of politicians, because they know they aren't expert decision-makers. (I suspect many of them would say that the important point is not to predict what will happen, but to have the moral compass to say what should happen.)

IMHO, getting our governments to move in this direction is of an importance on a par with thinking how to avoid making AIs that go berserk. Our existing forms of government are not competent enough to handle the power they will soon have. Evaluating track records is the only way I know to improve the quality of decision-making.

Comments on the math:

Don't assume that each agent is correct more than half the time. As I showed in my simulation, things are easily tractable when agents are correct more than half the time. Agents may be right more than half the time on problems in their life in general, but the problems that are controversial enough to come up in Congress are ones where overall we have closer to random performance.

The scheme you have proposed results in choosing the worst of two options when the correct choice is obvious. That's because if you have 100 voters, and 99 of them choose option A while 1 who is an idiot and is right half the time chooses option B, the values of (1 - c_i) for the 99 voters, multiplied together, will be less than 1/2.

Upvoted, because this is important and you're trying to formalize it.

Neat, I think this is the second time you've scooped me like that. As I mentioned in the other post, I don't have an exact timestamp for when I first came up with this or put it on the web, but as you say it's obvious enough that someone probably beat me to it. We, and it turns out the Black Belt Bayesian, all got there independently after all.

The math is right; note that it's a "<" sign in the multiplicative equation, and everything in the additive equation is multiplied by -1. Given the assumption of independence, it really is that simple.

Oops, the math is right. You are going to have problems in the other direction due to the falseness of the independence assumption - in a two-party system, whichever party is larger will generally win. But that happens today anyway.

I think you could do better if you look at correlations.

For example, if Alice and Bob always give the same vote, then their combined votes should be exactly the same as if the vote from one of them if only one was on the board.

Another possible improvement: Instead of voting yea or nea, you give a probability. If you give a higher probability, your vote counts more, and it helps you more if you're right, but it hurts you more if you're wrong.

This post sounds to me like the author almost invented prediction markets or other well-known expert-fusion information-aggregation techniques, but then fell into "not invented here" syndrome, rationalizing and hypothesizing unproven advantages to the tiny differences between histocracy and the other similar ideas.

Possibly this points out a flaw in our allocation of prestige - in order to correctly incent adoption, we need to spread prestige from the first or most well-known advocate of an idea to the early adopters. See Derek Siver's idea of the First Follower: http://sivers.org/ff

or other well-known expert-fusion information-aggregation techniques

Do you have some examples in mind?

I am not an expert, I just think it's a big field, with a lot of work done in it. There's an entire journal called "Information Fusion" for example (though I don't think highly of Dempster-Schafer stuff).

The best exemplar I can point to is Freund and Schapire's "Adaptive game playing using multiplicative weights": http://cseweb.ucsd.edu/~yfreund/papers/games_long.pdf

Another possibly-decent review is Blum and Mansour's chapter on regret minimization: www.cs.cmu.edu/~avrim/Papers/regret-chapter.pdf

Thankyou, I was curious!

Not my particular sin, since I came up with this before hearing of prediction markets.

Downvoted for messed-up formatting.

...and upvoted for the quick fix ! Thanks !

Often an applicant they turn down will get funding elsewhere and experience notable success or failure.

The problem is, this generally isn't the case. If a proposal passes, then your standing on the board is guerenteed to change, whereas if a proposal fails, then standings will only be adjusted if some other group decides to implement a similar measure. Humans are notoriously risk-averse, and we tend to fear losing things that we already have more than we value gaining new things. Consequently, I'd be concerned that such a set up might incentivise voting down unorthodox proposals that other organizations are unlikely to implement. If it such a proposal passes, then you are will definitely either gain or loose standing, whereas if it's rejected, your standing will likely remain unchanged.

There are some clear advantages to histocracy over futarchy: most relevantly, I believe histocracy will work well on a small scale, while prediction markets require a large crowd. Given enough time and participation, histocracy will inevitably beat a market.

You have this backwards. Given enough time and participation the market will win. It was the small scale where your histocracy had a chance. Given enough time and participation someone will seek to gain money by incorporating

There's less moral hazard, and less vulnerability to manipulation.

... in the prediction market. In a sufficiently open market the more someone tries to manipulate it the more reliable it gets.

Given enough time and participation the market will win.

Given enough time and participation the market will tend towards something like the arithmetic mean of the estimates of all of its participants. Which is great, Wisdom of Crowds and all that, and I'd even call the whole concept beautiful given the way it practically runs itself, but it's not even a local optimum among aggregation algorithms--you'd get a few points better calibrated if you threw out the bets made by Idiot Jed, Glutton for Punishment.

you'd get a few points better calibrated if you threw out the bets made by Idiot Jed, Glutton for Punishment.

You mentioned elsewhere that you haven't 'read up on your Hanson' regarding prediction markets. You need to. The above just isn't how prediction markets work. Moreover, if you added a bunch more Idiot Jed's you may end up with a market that is better callibrated.

Once again, I suggest you focus your attention on areas where for some reason a remotely efficient market just isn't feasible. This usually means either something where there is not enough money to seed a decent market (and not enough Idiot Jed's to steal from) or somewhere where an open market is not possible due to privacy considerations. This still leaves you with a rather large area for finding potential applications.

if you added a bunch more Idiot Jed's you may end up with a market that is better callibrated.

If smart money keeps coming in, then over time, Idiot Jed's opinions get discounted more and more, asymptotically (with exceptions like the one I mention in the other thread, where it's actually not smart to bet against him even though he's wrong). But if you're looking at one time-slice of a market-state, and you know that Idiot Jed is buying, you should always adjust in the opposite direction.

But if you're looking at one time-slice of a market-state, and you know that Idiot Jed is buying, you should always adjust in the opposite direction.

This is false. What makes you think you are better at accounting for Jed's idiocy than the other people in the market are? You need to abandon your equivocation between markets and an arithmetic mean of participant estimates. It really is more complicated than that.

At time t, the market hasn't adjusted yet. The other people in the market are noticing that the contract is overvalued because of Jed, so they're preparing to short it, which is how the market will adjust. Meanwhile, I'm noticing the same thing, so I'm making a prediction that's better than the market's current prediction.

This makes sense as long as you understand it's easy money.

Have you tried modeling your approach numerically and compare the results with alternatives?

[-][anonymous]12y30

The line spacing in this article makes it hard to read.

[This comment is no longer endorsed by its author]Reply

Is histocracy compatible with a secret ballot? (And for that matter, is futarchy?)

And as a separate question, would it be a good idea to keep voters' individual reliability scores secret, too? If a voter is known to have an accurate record and her opinion is public before a vote, couldn't she get overweighted, because she'll sway others' votes as well as getting more weight in the vote sum?

And as a separate question, would it be a good idea to keep voters' individual reliability scores secret, too? If a voter is known to have an accurate record and her opinion is public before a vote, couldn't she get overweighted, because she'll sway others' votes as well as getting more weight in the vote sum?

Would also be a target for focused lobbying/filtered argumentation.

This is a step in the right direction. Now take more steps, they should be trivial given momentum.

Examples of things to improve:

  • Voting on values and beliefs separately.
  • Voting on things that are not multi-choice, such as long a continuum or a sorting order.
  • Have voters give a probability for each choose rather than a single best option.
  • Vote on meta things like what type of problem it is, and have he votes of those who have been better at that specific type of problem count more.

This system dosen't seem to have a way of taking into acount varying levels of skill in diffrent fields. For instance, if someone is an expert in a paticular field, and is right about questions that field a very large percentage of the time, and most of the time votes only on questions in that field, that person's votes will have a very large weight on all questions, even if he only is average in subjects other than in his own field. In this system, his votes will have a very high weight, as he is almost always right on the questions he votes on, as he nearly only votes on the ones he knows a lot about. Then he goes and votes on some things he knows nothing about, yet he still gets high weights on his votes. If he votes on many decisions in his field for each one he votes on in fields he knows nothing about, he will keep his high vote weight, and so will keep influencing some decisions (that is, the ones not in his field) far more than he "should".

This person is hurting his overall voting weight as compared to those who stay within their areas of expertise, so he will be outvoted in his own field by other experts. And one person being a bit more weighty shouldn't amount to much in a large pool. The more realistic version of that scenario was actually addressed in OP:

There is a danger of a subgroup amassing a large voting weight, then abusing it in the window before they are removed from power, which can perhaps best be guarded against with some sort of constitutional system, perhaps even one formally incorporated into the system as a high Bayesian prior against certain classes of actions being correct.

I'm not entirely clear on how this improves on a prediction market with, say, fixed membership and no additional buy-ins.

Markets are a way of doing this, but they're optimized for ease-of-setting-up, not predictive power. There's no particular reason to expect them to do better than a good algorithm. They have well-documented irrationalities. I've seen Intrade go through tulipmania, and I've personally bought Sarah-Palin-To-Withdraw-As-VP-Candidate stock I thought was overvalued because I knew it was about to rise even further. Better to have no incentives to do anything other than be right and influence policy.

I've personally bought Sarah-Palin-To-Withdraw-As-VP-Candidate stock I thought was overvalued because I knew it was about to rise even further.

Funny thing, I'll bet somebody else did the exact same thing just before the price crashed. The fact that you made money in that instance doesn't prove that it was a sane decision.

(I do agree that Intrade suffers from market failures of various kind, but many of them are caused by high barriers to liquidity- like the fact that you couldn't make all that much money by shorting a fringe candidate's chances, even if that candidate's supporters were irrationally buying a 1% chance up to 5%.)

But that sort of thing is sometimes going to be a sane decision, if you're sufficiently confident you understand the market (in this case, I wasn't trading on momentum, it was just that she made a major gaffe and the market hadn't responded yet). In an ideal system, misrepresenting your beliefs should never be rewarded with increased voting power. It's not an anecdote, it's an existence proof.

I'm being kind of glib because I haven't really read my Hanson on this.

Markets are a way of doing this, but they're optimized for ease-of-setting-up, not predictive power. There's no particular reason to expect them to do better than a good algorithm.

Awesome. Please provide me with the good algorithm and point me in the direction of a market that is inferior to said algorithm.

(Alternately: "People like money" is a good reason.)

The problem is that I have no dataset on which to test an algorithm. Even if I could get access to the trade history of a large prediction market, sorted by trader, it would still be problematic to convert trades directly to predictions. if I buy Palin-Withdraw at 5% and sell it at 7%, is it because I think her chance of withdrawing is 6% or because I'm playing the market? PredictionBook.com might work--my guess is there's not enough users and not enough calibration data yet to beat Intrade, but I could be wrong.

That's why I've worded the article more like advocacy than like an academic paper. People need to try it before we can test it. If you don't think people should try it, you need to object on hypothetical grounds, not demand unobtainable evidence.

(People like money more than they like making good predictions. They play the market, they're risk averse, and they discount value over time. The latter two demonstrably lead to systemic calibration errors in real prediction markets, according to a paper I just stumbled across while looking for data.)

If you don't think people should try it, you need to object on hypothetical grounds, not demand unobtainable evidence.

I'm demanding a free money generator. The point here isn't that the evidence is unobtainable but rather that the very nature of markets dictates that if such an algorithm can be obtained then any sufficiently large market will quickly be exploited by someone with the algorithm and so will learn from it. This just means that you need to focus your attention on applications that for some reason cannot be large enough or open enough to operate remotely efficiently.

I understand the principle (and your advice is obviously sound and well-taken), although I'm still feeling adversarial enough to note that the inverse also holds. If you're using a market to be the best voter in a histocracy, the histocracy learns from the market.

And there's no free money to be had when the market's being systemically irrational due to individual traders being rational. I think this contract is way too high, but I'm not going to risk (and render illiquid) a few thousand dollars now to win a few hundred dollars in November, and that's why the contract is overvalued. There may be no investor anywhere who's willing to pay $1000 in January in exchange for a 99% chance of $1030 in November, which would mean that no matter how big Intrade gets, longshot contracts will remain overvalued. This isn't a counterexample to general efficient market theorems; it's just that the economically correct price for the contract is not equal to its expected value.

it's just that the economically correct price for the contract is not equal to its expected value.

This is true. Translation is required.

I think it's worse than that. Suppose we're in a futarchy and there's a proposal to build an innovative new kind of nuclear reactor in the heart of Cardiff. According to our state-determined utility function, this has positive utility if and only if the risk of meltdown within 10 years is less than .00001. But regardless of what people actually think, the price of CardiffReactor.Meltdown.Before2022 has a hard floor: the price where it's not worth it to anyone on the planet to short-sell, because there are better investment opportunities elsewhere. If this floor is greater than .00001 (and how could it not be?), the market provides no usable information.

The difficulty here is one of constructing markets and, critically, derivatives that make trading based on that sort of information feasible. This is the sort of thing that would require infrastructure in place to allow large amounts of complicated trading. If we were actually in a serious futarchy this kind of thing could and probably would be traded on, albeit indirectly in a huge sea of conditional probability payoff systems.

For us, given that we are not in a mature futarchy (and do not otherwise have access to an advanced and heavily traded prediction market), we are of course unable to use a market to directly answer that kind of question.

I think it would have been better if you didn't have the first post, which sounded like you thought you'd found the Best System Ever and not just an improvement on the status quo (albeit one which might be simpler to implement than all-out futarchy). I foresee it being difficult to convince organizations to do this, though; the boards won't want to put their high status on the line like that.

With a restricted domain and certain assumptions, I do think I've found the Best System Ever. The first post was because I'm not confident.

Well, then I think you haven't put enough thought into how the system might be gamed (as it would be in practice). With your initial naive version, there would be an incentive to weigh in only on decisions that are slam-dunks and on decisions that you personally have a stake in, using the first to "buy" credibility that you "spend" on the other. Because of this, difficult decisions would be dominated by people with ulterior motives.

Now, of course there can be fixes for this, but it serves to illustrate that your system probably won't be perfect fresh out of the box. Again, I think it would be an improvement on a system that doesn't even track people's records, but I don't share your total zeal.

Also, people's decision making abilities change over time. What I did right or wrong 5 years ago is not as important as what I did right or wrong one week ago. So, the influence of past decision scores should diminish as time passes. Exponential decay of importance, maybe? Another way you could do it is use a rating number for each player analogue to those used e.g. in chess (ELO).

After thinking a bit about your proposed scheme, I see three non-negligeable drawbacks, which don't make it useless, but which (at least in my opinion) significantly reduce the cases in which it can be safely used (at least, without modifications).

The first issue is the one I already spoke about (but about prediction market) in the teaser : your scheme will work well if (like in the case of the charity board deciding which project to finance), the ones taking the decisions don't have much involvement in the project later on. But if you try to apply that to situations like a group of engineers deciding which technical solution to use for their own project, the incentive effect will be very dangerous : if I voted against the solution that was finally chosen, my interest is now to ensure the project will fail, so I would have been right.

The second issue is easier to explain with an example. Imagine you've 10 persons in the board of a charity that has to approve or refuse projects. One of the 10, person O, is very optimist. On the latest 100 proposal, he approves 80 of them, and disapproved only 20 of them. On the 80 he approved, 35 were later on judged to be bad projects and 45 good ones. On the 20 he refused, only one was in fact a good project (and the 19 others bad ones). The other, person N, is normally optimist. He approved 50 projects, disapproved 50. On the 50 he approved, 10 were bad, 40 were good. On the 50 he disapproved, 40 were bad, 10 were good (they didn't vote always on the same, so numbers don't have to match exactly). So if you make ratio, O was right only 68% of the time when he approved, but he was right 95% of the time when he disapproved. N was right 80% in all cases. Well, with such a record, I would give O's vote more weight when he opposes to a project, and less when he approves one.

The third issue is about risk taking. If you consider a board of directors of a (for profit or not) research agency, who have to approve funding to research projects. Two projects arrive on the table, both require 100 units of financing. One is a low-risk low-gain project A, which is 90% likely to succeed, and will lead to 150 units of gain if it succeeds. The other is a high-risk high-gain project B, which is only 10% likely to succeed, but will lead to 2 000 units of gain if it succeeds. In expectancy, project A is worth 150 0.9 - 100 = 35, project B is worth 2000 0.1 - 100 = 100. There are cases in which it's better to chose project A - but most of the time, it would be better to chose B. But if you chose B, you're very likely to be found to have been wrong in hindsight. So with a scheme like the one you propose, decision-makers would favor project A over project B, even if net gain expectancy is only about one third..

The last two issues are a bit of the same : your scheme is interesting, but seems too "binary" (you were right or you were wrong, and we average how often you were right/wrong as your global credibility), and therefore doesn't work well with some of the complexity of decision taking (optimism vs pessimism, low-risk low-gain vs high-risk high-gain, or motivational/incentive issues). But if you are in a case in which those issues don't matter much, then it sounds very promising.

Making the system non-binding wouldn't be that good a test, as there would be much less incentive to bias the decisions to your own personal benefit. There are a couple of compromises with majority vote that could be used to prevent a small group doing this. One, put an upper and lower limit on the value of each vote, so that no votes are worth more than (for example) ten times any vote. Two, allow people voting against a proposal to mark their votes as vetoes, and some number of vetoes, maybe 66%, being enough to veto the proposal. 2B, allow the same thing to be done in favour of the proposal (may be less important than preventing bad proposals passing).

You could try using a linear regression to predict the correct answer from votes. That would still be fairly simple and transparent. The usual linear regression tries to minimise the squared error of the prediction, but that should probably be tweaked when predicting a binary answer. Dunno if you can get it to make sense from a Bayesian viewpoint.

"Maximize the log probability given to the correct outcome" is possibly the Bayesian way. Instead of a line, just choose some curve.

Line-spacing fixed, thanks.

(I am slightly bemused that single-spaced Helvetica got such a strong negative reaction. I like single-spaced Helvetica.)

The lesson is: don't mess with the defaults unless you want to distract from your message.