Academic Proof-of-Work in the Age of LLMs

LawrenceC

Written quickly as part of the Inkhaven Residency.

A widely known secret in academia is that many of the formalities serve in large part proof of work. That is, the reason expensive procedures exist is that some way of filtering must exist, and the amount of effort invested can often be a good proxy for the quality of the work. Specifically, the pool of research is vast, and good research can often be hard to identify. Even engaging in research enough to understand its quality can be expensive. As a result, people look toward signs of visible, expensive effort in order to determine whether to engage in the research at all.

Why do people insist only on reading research that’s published in well-formatted, well-written papers, as opposed to looking at random blog posts? Part of the answer is that good writing and formatting makes the research easier to digest, and another part is that investing the time to properly write up your results often causes the results to improve. But part of the answer is proof-of-work: surely, if your research is good, you’d be willing to put in the 30-40 hours to do the required experiments and format it nicely as a paper?

Similarly, why do fields often insist on experiments beyond their scientific value? For example, why does machine learning often insist that people do expensive empirical experiments even for theory papers. Of course, part of the answer is that it’s easy to generate theoretical results that have no connection to reality. But another part of the answer is that doing the empirical experiments serves as the required proof of work; implementing anything on even a medium sized open-source LLM is hard, but surely you’d invest the effort if you believed enough in your idea? (This helps explain the apparently baffling observation that many of the empirical results in theoretical papers have little relevance to the correctness or even the applicability of the theoretical results.)

Other aspects of ML academia – the beautifully polished figures^[1], the insistence on citing the relevant papers to show knowledge of the field, and so forth – also exist in part to serve as a proof-of-work filter for quality.

In a sense, this is one of the reasons academia is great. In the absence of a proof-of-work system, the default would be something closer to proof-of-stake: that is, some form of reputational system based on known, previously verified accomplishments. While proof-of-work filters can be wasteful, they nonetheless allow new, unknown researchers to enter the field and contribute (assuming they invest the requisite effort).

An obvious problem with this entire setup is that LLMs exist, and what was once expensive is now cheap. While previously, good writing was expensive, LLMs allow anyone to produce seemingly coherent, well-argued English text. While it was once quite expensive to produce ML code, current LLMs produce seemingly correct code for experiments quickly. And the same is true for most of the proof-of-work signifiers that academia used to depend on: any frontier LLM can produce beautifully formatted figures in matplotlib, cite relevant work (or at least convincingly hallucinate citations), and produce long mathematical arguments.

I’ve observed this myself in actual ML conference contexts. In the past, crackpot papers were relatively easily to identify. But in the last year, I’ve seen at least one crackpot paper get past other peer reviewers through a combination of dense mathematical jargon and an expansive code base that was hardcoded to produce the desired results. Specifically, while the reviewers knew that they didn't fully understand the mathematical results, they assumed that this was due to their lack of knowledge, instead of the results themselves being wrong. And since the codebase passed the cursory review given to it by the other reviewers, they did not investigate it deeply enough to notice the hardcoding.^[2]

In a sense, this is no different than the problems introduced by AI in other contexts, and I’m not sure there’s a better solution than to fall back to previous proof-of-stake–like reputation systems.^[3] At the very least, I find it hard not to engage with new, seemingly-exciting results from unknown researchers without a high degree of skepticism.

This makes me sad, but I'm not sure there's a real solution here.

^{^}
Especially the proliferation of beautiful "figure one"s that encapsulate the paper's core ideas and results in a single figure.
^{^}
In fact, it took me about an hour to decide that the paper's results were simply wrong as opposed to confusing. Thankfully, in this case, the paper's problems were obvious enough that I could point at e.g. specific hardcoded results to the other reviewers, (and the paper was not accepted for publication) but there's no guarantee that this would always be the case.
^{^}
Of course, there are other possibilities that less pessimistic people would no doubt point to: for example, there could be a shift toward proof-of-work setups that are LLM resistant, or we could rely on LLMs to do the filtering instead. But insofar as LLMs are good at replicating all cognitively shallow human effort, then I don't imagine there are going to be any proof-of-work setups that would continue to work as LLMs get better. And I personally feel pretty sad about delegating all of my input to Claude.

This makes me sad, but I'm not sure there's a real solution here.

Once the AI is doing most of the work, there are AI debate and formal verification schemes which might help. This of course assumes we’ve solved alignment and many other issues.

I agree. “I worked really hard on it” is neither necessary nor sufficient for research quality. We already know that lots of careful-looking, labor-intensive, neatly written work can still be wrong or non-replicable. Meanwhile, some valuable insights emerge from relatively simple “aha” moments, and some deep ideas are developed more clearly outside the formal journal pipeline (ex: The Bitter Lesson).

Instead of reverting back to the old imperfect proof-of-work proxy for truth, we should try figuring out how to use these new AI tools to help assess research merit more efficiently.

Granted, some research work will require expensive experiments or other forms of "hard work", in which case proof-of-work can still function as a useful initial filter.

In your post, you say that proof-of-stake is reputation-based and doesn't allow entries to newcomers. But I'm thinking, isn't something like prediction markets, having to pay your replicators and open your prediction market on whether it will replicate, closer to actual proof-of-stake? You're wagering part of your ressource on this actually working.

In the abstract, proof-of-stakes seems less wasteful to me, so I think doing away with part of proof-of-work might be beneficial if we know to do it well. Though of course, your actual point about LLMs and how to do anything when everything can be automated does stand and concerns me more and more.

Such proof-of-stake seems like a prohibitive friction cost and would prevent many legitimate authors from publishing, and many illegitimate authors honestly believe that their work will replicate, and many legitimate authors' work fails to replicate anyway. The analogy to proof-of-stake (in the original post) is inexact, it's closer to proof-of-stake in academia.

I'm not exactly sure what you mean, or what the crux is here.

The classical argument for "many illegitimate authors honestly believe that their work will replicate" is that this is then free money that creates disincentives for those authors. The first point I'm more unsure about, but I'm not exactly sure what friction you see there. Is it reputational (e.g. betting on prediction markets seems shady), or not having the money to open up the prediction market/pay replicators? In the second case, you could imagine something like VC/funders for promising researchers/authors, who in turn are incentivized to evaluate the fundees properly.

The analogy to proof-of-stake (in the original post) is inexact, it's closer to proof-of-stake in academia.

Unsure what you mean exactly. If you're saying "the stakes the author is talking about is stakes in the academia, so reputations and the like", I think I agree. But the author seemed to be making a broader point about academia so far working by doing proof-of-work (so you're spending many useless hours on writing dissertations for a topic that's not actually going to be relevant for your field), and proof-of-stake not offering an appealing option with the analogy, so I was continuing the analogy to see possibilities that seemed to work well in the abstract.

Proof-of-stake and proof-of-work are both often implemented cryptographically, because in cryptographic domains, verification can be easier than generation. I think another option is to apply that principle to the problem directly, where possible. The best example: formalizing a math theorem in lean means it's much easier to verify than it is to read. CS and ML papers can sometimes (if making software is now much easier) be implemented into toy examples, sized appropriately for reviewers (and reviewers' AI instances) to check for hardcoding or cheating. Tough data analysis domains could be turned into raw data from a reputable source plus a minimal, non-steering prompt for reviewers' models to re-discover what the author wanted to publish. Eventually, I think automated or simulated biology/chemistry labs could be funded to attempt to reproduce new papers' results, putting their reputations on the line.

This is not sufficient to protect against motivated bullshitters, especially not in all domains, and I think in the near future institutions may be forced to fall back more to reputation. But I think it's workable. Academic proof-of-work is only, at best, a proxy to avoid being overloaded with verification work, but I think we can make verification easier in other ways.

In computer science, writing up the paper is (almost always) not the hard part. So it’s not acting as proof of work.

I’m told that in other subjects, such as history, turning the raw data into a readable account is the hard part.

The issue with proof-of-stake is that academic reputation is currently a very qualitative system, where X being a better researcher than Y is something that "you just know". There are certainly statistics, like citations and "good quality journals", but those metrics already get gamed pretty mercilessly even when nothing is explicitly thresholded by them. Even if it were possible to identify an un-Goodhartable metric of scientist quality, there is both a longstanding egalitarian culture in the sciences that would bristle at this and a more modern sense of political correctness that would absolutely ignite over it.

That said, I've had a semi-serious idea on the topic of Proof of Work in the age of LLMs for a while. Given the well-established relationship between physical fitness and intellectual capability, causal in both directions^[1], physical proof-of-work^[2], administered at a gym slash testing center, could be an effective, non-automatable substitute for form-filling with substantial positive externalities. Moreover, it would touch on the ancient Greek roots of much of modern science and philosophy - Plato was a wrestler, after all!

^{^}
Smarter people can more effectively manage their health, and a healthier body improves the brain's ability to function
^{^}
Namely, push ups, with the necessary caveats for disability accommodation.

This makes me sad, but I'm not sure there's a real solution here.

Once the AI is doing most of the work, there are AI debate and formal verification schemes which might help. This of course assumes we’ve solved alignment and many other issues.

Instead of reverting back to the old imperfect proof-of-work proxy for truth, we should try figuring out how to use these new AI tools to help assess research merit more efficiently.

Granted, some research work will require expensive experiments or other forms of "hard work", in which case proof-of-work can still function as a useful initial filter.

I'm not exactly sure what you mean, or what the crux is here.

The analogy to proof-of-stake (in the original post) is inexact, it's closer to proof-of-stake in academia.

In computer science, writing up the paper is (almost always) not the hard part. So it’s not acting as proof of work.

I’m told that in other subjects, such as history, turning the raw data into a readable account is the hard part.

^{^}
Smarter people can more effectively manage their health, and a healthier body improves the brain's ability to function
^{^}
Namely, push ups, with the necessary caveats for disability accommodation.

90

Academic Proof-of-Work in the Age of LLMs

90

90

90