I recently found myself in a spirited debate with a friend about whether large language models (LLMs) like GPT-4 are mere stochastic parrots or if they can genuinely engage in deeper reasoning.

We both presented a range of technical arguments and genuinely considered each other’s points. Despite our efforts, we ended up firmly holding onto our initial positions. This led me to ponder: How can I determine if I am right when both of us are convinced of our correctness, yet at least one of us must be wrong?

To address this, I developed a scoring system using measurable metrics to determine who is more likely to be correct. I call it the AmIRight Score.

AmIRight Score

The AmIRight Score assigns points across several categories, helping to gauge the likelihood of being correct. Here’s how you can calculate your score:

1. Clarity in Falsification Criteria – 10 points

A person who can clearly articulate how their belief could be proven wrong demonstrates the ability to conceptualize alternative truths. If someone cannot envision any scenario that would falsify their belief, it suggests that their belief might be dogmatic.

Example of a good falsification statement: “I would believe AI is capable of deeper reasoning if it can be trained on data containing no information about chess, and then perform as well as a human that is also new to the game, given the same set of instructions.”

Example of a bad falsification statement: “I would believe AI is capable of deeper reasoning if all the scientists in the world acknowledged they were wrong about reasoning based on new evidence about the brain.”

2. The Simplified Ideological Turing Test – 10 points

The Ideological Turing Test evaluates how well you can articulate the opposing viewpoint. In the simplified version, both parties write arguments for their own position and the opposite position. A neutral judge then scores how well each argument is presented without knowing who wrote what.

3. Forecasting Accuracy – 5 points

Forecasting accuracy assesses the correctness of your predictions about future events. This metric rewards those whose predictions consistently turn out to be accurate. Both parties should take the same forecasting test, and points are awarded based on performance.

4. Forecasting Calibration – 5 points

Forecasting calibration measures how well your confidence levels match actual outcomes. It’s not just about being right but also about accurately assessing the likelihood of being right. The same forecasting test used for accuracy can measure calibration, with points awarded based on the Brier score of the predictions.

5. Deeper Understanding of the Subject – 5 points

This metric evaluates your comprehension of the subject’s complexities and nuances beyond surface-level knowledge.

 

Final Thoughts

While the AmIRight Score can be a useful tool for assessing probabilities in one-on-one debates/arguments, its applicability might be limited in areas where there are many brilliant minds on either side of the argument. Nonetheless, it provides a structured approach to critically evaluating our beliefs and arguments.


 

New Comment
5 comments, sorted by Click to highlight new comments since:

These things are indeed correlated with being right, but aren't you risking Goodharting? What does it really mean to "be right" about things? If you're native to LessWrong you'll probably answer something like, "to accurately anticipate future sensory experiences". Isn't that all you need? Find an opportunity for you and your friend to predict measurably different futures, then see who wins. All the rest is distraction.

And if you predict all the same things, then you have no real disagreement, just semantic differences

In some cases I agree, for example it doesn't matter if GPT4 is a stochastic parrot or capable of deeper reasoning as long as it is useful to whatever need we have.

Two out of the five metrics are predicting the future, so it is an important part of knowing who is right, but I don't think that is all we need? If we have other factors that also correlates with being correct, why not add those in?

Also, I don't see where we risk Goodharting? Which of the metrics do you see being gamed, without a significantly increased chance of being correct also being increase?

Why pay mind to what's correlated with being right, when you have the option of just seeing who's right?

I'm arguing that being right is the same as "holding greater predictive power", so any conversation that's not geared toward "what's the difference in our predictions?" is not about being right, but rather about something else, like "Do I fit the profile of someone who would be right" / "Am I generally intelligent" / "Am I arguing in good faith" etc.

I like this idea but the point values seem arbitrary.

True, would be interesting to conduct an actual study and see which metrics are more useful predictors.