Proposal for an AI Safety Prize

sweenesm

After reading Yudkowsky’s list of lethalities, I searched for some kind of prize out there that might be similar to what he suggests in his point #40, to motivate truly impactful AI safety work:

“promising to pay big money retrospectively for good work to anyone who produces it.”

I found a few prizes that had been awarded already or to which applications were closed. I also found people talking about potential prizes that, as far as I know, never came to be.

Here I outline what I think a decent AI Safety Prize could look like. I’m happy to hear suggestions for improvements on this proposal, or potential issues you see with it, so please do write them in the comments if you have them. I may send this off to XPRIZE as a prize suggestion.

Outline of an AI Safety Prize

This AI Safety Prize would be focused on establishing the ability to produce a safe, "aligned" AGI system (as indicated, but not proven, by a system that showed apparent alignment for at least 3 months, see below). It would not, of course, be a guarantee that no misaligned AGI ever came online.

The total prize should be at least $10M, to catch enough attention, with at least $5M of that being possible for one individual or team to win. Who would pay for it? Companies that ended up using the developed AI safety technique(s). Participating companies would agree to pay out a certain prize amount if they used a safety advancement in their AGI/near-AGI product for at least 3 months with no apparent misalignment/safety issues showing up. If companies didn’t end up using a particular technique, they wouldn’t pay out for it. For closed models, this would require honesty on their part, but I suspect the small relative prize quantities (compared to the huge economic upsides of AGI) and possible downsides of bad PR would motivate companies to stay honest. Companies could also decide to solicit funding for their portion of the prize from donors.

Safety concepts would have to be actionable by someone reasonably skilled in the art. For example, saying “we should use reinforcement learning with feedback from elephants” wouldn’t qualify for a prize, while providing a specific mechanism/algorithm that could do this (detailed plans for brain chips plus peanut reward functions?) would. Advancements that make it into a product to enhance its performance, not its safety, wouldn’t count. Dual use concepts (those that enhance both performance and safety) may not be awarded a prize if companies, in consultation with judges, decide it would be too much of a safety risk to admit to using such a concept to make an AGI/near-AGI.

Any ideas/results already publicly available before the opening of the prize application period wouldn’t be eligible for a prize. [Added 2-1-24: Any ideas already developed in-house at the prize sponsor companies, but not publicly known, would also not be eligible.]

The prize application period would be open for 4 years. Lesser prizes would be awarded for ideas that go into products that are near-AGI than those that go into AGI. If any of these ideas scaled to AGI within a period of 4 years after the prize application deadline, they could win an additional AGI-level prize as well. Prizes could be split if multiple ideas were used together. There would be an independent panel of 3 judges whose decisions were final in regards to how much to award to whom. Judges and their immediate families would be ineligible for awards.

Even though some may spin this as companies outsourcing safety, I think it generally would be a PR win for companies to participate in funding this prize. Potential funding companies include OpenAI, Anthropic, Meta, Alphabet, Amazon, IBM, Apple, Microsoft, and x.AI. I’m sure there are others. If 5 companies signed up, they could be liable, for instance, for $2M each, although unequal contributions to the prizes would be acceptable.