Two notes:
See this: https://www.lesswrong.com/posts/CTBta9i8sav7tjC2r/how-to-hopefully-ethically-make-money-off-of-agi
Can you CC me too?
I work from the same office as John; and the location also happens to have dozens of LessWrong readers work there on a regular basis. We could probably set up an experiment here with many willing volunteers; and I'm interested in helping to make it happen (if it continues to seem promising after thinking more about it).
[Mod note: I edited out your email from the comment, to save you from getting spam email and similar. If you really want it there, feel free to add it back! :) ]
Mod here: most of the team were away over the weekend so we just didn't get around to processing this for personal vs frontpage yet. (All posts start as personal until approved to frontpage.) About to make a decision in this morning's moderation review session, as we do for all other new posts.
Jake himself has participated in both Zika and Shigella challenge trials.
Your civilisation thanks you 🫡
Cool idea and congrats on shipping! Installed it now and am trying it. One user feedback is I found the having-to-wait for replies a bit frictiony. Maybe you could stream responses in chunks? (I did for a gpt-to-slack app once. You just can't do letter-by-letter because you'll be rate limited).
If that's your belief, I think you should edit in a disclaimer to your TL;DR section, like "Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don't trust their methodology".
Also, the numbers aren't "non-provable": anyone could just replicate them with the GPT-4 API! (Modulo dataset contamination considerations.)
Humans achieve over 95% accuracy, while no model surpasses 50% accuracy. (2019)
A series on benchmarks does seem very interesting and useful -- but you really gotta report more recent model results than from 2019!! GPT-4 allegedly surpasses 95.3% on HellaSwag, making that initial claim in the post very misleading.
Noting that a nicer name that's just waiting to be had, in this context, is "Future of the Lightcone Institute" :)