After thinking a bit about your proposed scheme, I see three non-negligeable drawbacks, which don't make it useless, but which (at least in my opinion) significantly reduce the cases in which it can be safely used (at least, without modifications).
The first issue is the one I already spoke about (but about prediction market) in the teaser : your scheme will work well if (like in the case of the charity board deciding which project to finance), the ones taking the decisions don't have much involvement in the project later on. But if you try to apply that to situations like a group of engineers deciding which technical solution to use for their own project, the incentive effect will be very dangerous : if I voted against the solution that was finally chosen, my interest is now to ensure the project will fail, so I would have been right.
The second issue is easier to explain with an example. Imagine you've 10 persons in the board of a charity that has to approve or refuse projects. One of the 10, person O, is very optimist. On the latest 100 proposal, he approves 80 of them, and disapproved only 20 of them. On the 80 he approved, 35 were later on judged to be bad projects and 45 good ones. On the 20 he refused, only one was in fact a good project (and the 19 others bad ones). The other, person N, is normally optimist. He approved 50 projects, disapproved 50. On the 50 he approved, 10 were bad, 40 were good. On the 50 he disapproved, 40 were bad, 10 were good (they didn't vote always on the same, so numbers don't have to match exactly). So if you make ratio, O was right only 68% of the time when he approved, but he was right 95% of the time when he disapproved. N was right 80% in all cases. Well, with such a record, I would give O's vote more weight when he opposes to a project, and less when he approves one.
The third issue is about risk taking. If you consider a board of directors of a (for profit or not) research agency, who have to approve funding to research projects. Two projects arrive on the table, both require 100 units of financing. One is a low-risk low-gain project A, which is 90% likely to succeed, and will lead to 150 units of gain if it succeeds. The other is a high-risk high-gain project B, which is only 10% likely to succeed, but will lead to 2 000 units of gain if it succeeds. In expectancy, project A is worth 150 0.9 - 100 = 35, project B is worth 2000 0.1 - 100 = 100. There are cases in which it's better to chose project A - but most of the time, it would be better to chose B. But if you chose B, you're very likely to be found to have been wrong in hindsight. So with a scheme like the one you propose, decision-makers would favor project A over project B, even if net gain expectancy is only about one third..
The last two issues are a bit of the same : your scheme is interesting, but seems too "binary" (you were right or you were wrong, and we average how often you were right/wrong as your global credibility), and therefore doesn't work well with some of the complexity of decision taking (optimism vs pessimism, low-risk low-gain vs high-risk high-gain, or motivational/incentive issues). But if you are in a case in which those issues don't matter much, then it sounds very promising.
The following is slightly edited from a pitch I wrote for a general audience. I've added blog-specific content afterwards.
Information technology allows for unprecedented levels of collaboration and debate. When an issue arises people communicate freely, are held accountable for misrepresentations, and are celebrated for cogent analysis. We share information and opinion better than ever before. And then we leave the actual decision up to one person, or a tiny committee, or a poll of a population that for the most part wasn't paying attention, or at best an unpredictably irrational market. The one thing we still don't aggregate in a sophisticated way is human judgment.
Organizations evolve complex decision-making structures because variance in human judgement is complicated. We try to put the most competent person in charge—but there is wisdom in crowds, and so a wise leader gets buy-in from a broad pool of competent subordinates. We must constantly try to evaluate who has the best record, to see who's been right in the past...and we get it wrong all the time. We overestimate our own competence. In hindsight, we misremember the right decision as being obvious. We trust the man with the better hair. Any organization with group buy-in on decisions amasses a solid amount of data on the competence of its members, but it does not curate or use this data effectively.
We can do better, using current technology, some simple software, and some relatively simple math. The solution is called histocracy. It is most easily explained with a use case.
The H Foundation is a hypothetical philanthropic organization, with a board of twelve people overseeing a large fund. Each year, they receive and review several hundred grant applications, and choose a few applicants to give money to. Sometimes these applicants use the money effectively, and sometimes they fail. Often an applicant they turn down will get funding elsewhere and experience notable success or failure. In short, it is often obvious to the board in hindsight whether they made the right decision. For each application, the yay or nay of each board member is recorded. If and when, later, the board reaches a consensus on whether that application should have been approved, this consensus is recorded as well. The result is that each board member accumulates a score. Alice's votes have been right 331 times and wrong 59 times, while Bob's votes have been right 213 times and wrong 110 times (they weren't always present for the same votes). Already from this raw data we can see that Alice's opinion should count for more than Bob's. With a computer's ease with arithmetic, we can quantify this. Some math is given in an appendix; here it suffices to say that it would be reasonable to give Alice's vote a little over 7/4ths the weight of Bob's: if the board is to maximize its chance of making the correct choice, 4 Alices should be able to outvote 7 Bobs. The board members each connect to a shared server and vote on each application; the software performs the relevant calculations and determines the victor.
In this system, the board members perform the massively complex task of evaluating the applicants, a job requiring expert judgment and intuition, while the computer dispassionately and precisely evaluates the board. The result is a system wiser than any individual board member.
When scaling this solution up to a large business with thousands of employees, the math stays the same while the interface changes. Decisions need to be shared and discussed on a corporate intranet, and tagged by type so that employees can find and vote on only those decisions they feel competent to vote on. Employees who try to make decisions on matters beyond their competence will fail to accumulate enough voting weight to skew the decision; this means that decisions in all areas can be opened to the entire field. Managers should be encouraged to reframe decisions they are pondering as corporation-wide referenda. Evaluating a decision in hindsight should in this case be reserved to the owners and shareholders, or to a system or charter they have approved.
Expanding the scale even further, the same approach could be applied to advice, solicited or unsolicited. Consider a site to which clients could pay to submit polls on decisions that concerned them. The polls would be conducted and reported histocratically. The client would later be asked to report whether the advice given by the community turned out to be correct. Prizes and recognition could be given to those solvers who accumulate the highest voting weights, thereby incentivizing participation and excellence. For unsolicited advice, a similar approach could be used with petitions.
In summary, we note that human judgment is essentially a set of predictions, and thus can be judged empirically and aggregated mathematically. Group decision-making is such an omnipresent and consequential task that optimizing it may be the single most important thing we can do. Let's do it rigorously, and let's do it now.
On Sunday, I posted a call for solutions in advance of this post. Which is a weird thing to do, but I have a terror of Irrevocable Actions, and I can't untell you something. (Coincidentally, at the same time as people were chiding me for this, a discussion started about my also mildly eccentric decision to put my play behind a semipermeable paymembrane, which has a similar explanation; it's easier to make something free that was once non-free than the reverse, and in many circles charging for something is actually higher-status.)
I didn't mention prediction markets because I didn't want people to anchor on it—it's just a hop, skip, and a jump from futarchy to histocracy, so that would obviate the point. As expected, people went there immediately anyway, and from there to something very close to my idea. Much of the discussion centered around the difficulty of creating a well-defined charter. While I certainly agree that a quantifiable group utility function is usually difficult, if you go up a level of meta you'll see that well-defined charters are everywhere: a decision is correct if and only if the people in power judge it to have been correct. To be a democracy, we don't need to explicitly vote on values—we just need to let people vote on consequences in accordance with their values. The king's order may be ambiguously worded, but your true duty is clear: please the king.
There are some clear advantages to histocracy over futarchy: most relevantly, I believe histocracy will work well on a small scale, while prediction markets require a large crowd. Given enough time and participation, histocracy will inevitably beat a market. There's less moral hazard, and less vulnerability to manipulation.
Futarchy beats histocracy in that there's a built-in incentive to participate and excel: but people vote in elections and serve on non-profit boards for free, so I don't see a huge need to inject cash. Futarchy allows for individual actors to express degrees of confidence in a way that my model of histocracy doesn't, but this could be remedied where feasible. And Hanson's ideas for how to judge consequences in hindsight might be appropriate for some histocracies.
The potential pitfalls of histocracy depend on the specific implementation. I see politics, in the blue vs. green mind-killing sense, and difficulty of evaluating consequences even in hindsight as the two major Achilles heels; but as far as I can see these are universal. There is a danger of a subgroup amassing a large voting weight, then abusing it in the window before they are removed from power, which can perhaps best be guarded against with some sort of constitutional system, perhaps even one formally incorporated into the system as a high Bayesian prior against certain classes of actions being correct.
I should also concede up front that my “mathematical” appendix glosses over the serious AI challenge of doing this right: hopefully, the computing power available to a histocracy will grow much faster than the number of voters. Log(LaPlace(Record)) will double-count terribly in large groups, but it does have the advantage of being simple and transparent—entrusting your government to a black box is scary.
Groups giving histocracy a try should start by making it nonbinding. Only when it's working better than your current system should it be adopted. Unless, of course, your current system is a majority vote, in which case you might as well start using it right away.