Grading my 2024 AI predictions

Nikola Jurkovic

On Jan 8 2024, I wrote a Google doc with my AI predictions for the next 6 years (and slightly edited the doc on Feb 24). I’ve now quickly sorted each prediction into Correct, Incorrect, and Unclear. The following post includes all of my predictions for 2024 with the original text mostly unedited and commentary in indented bullets.

Correct

there is a viral app (probably Suno) for generating music, reaching 1 million users by July 2024
- Suno had 10 million users in May.
An open source GPT-4-level model is released.
- Llama 3.1 probably fits the bill.
Adept AI and similar publicly available browser-based assistants are still not useful enough to be used on browser windows without being supervised by a human for more than 30 seconds. They still have problems like clicking on the wrong part of the screen, getting lost, getting distracted, etc.
- I haven’t seen any agents that are actually able to navigate a browser competently yet.
Sora is released to customers who apply for access.
If OAI makes the video continuation feature available, many new memes are created where people use Sora to extend existing videos in funny ways or stitch two videos together.
- Example (although these don’t use Sora). I find it amusing how specific this prediction was. Possibly I’d already seen an example at that point?
We will see the first signs of evals for moral patienthood in LLMs. Some of the AGI labs will make a public statement where they mention this possibility.
- The Anthropic Fellows Program is looking for people to work on “AI Welfare: Improving our understanding of potential AI welfare and developing related evaluations and mitigations.”
6/12 METR tasks are complete
- This suite is deprecated but my best guess is that it would resolve Correct.
1/5 ARA tasks are complete
- This suite (the five tasks described in Anthropic's original RSP) is deprecated but it’s likely true that Claude 3.5 Sonnet Upgraded would complete at least one task.

Incorrect

AI music leads to protests/complaints in the music industry. Major artists (>3% of playtime-weighed artists) say something like “AI music is bad!”.
Microsoft Copilot and similar tools increase office worker productivity (speed on <1hr tasks) by 25%. Most of the accelerated labor is pretty menial (making presentations, writing emails, making/manipulating spreadsheets)
OpenAI keeps some sort of logs for more than half the videos it generates, so that it can power an AI-generated video detection tool which checks videos against their database to check if they’re Sora-generated.
Many artists (especially those working on filmmaking/3D animation) are pissed off by [Sora], and protests against selling AI generated video happen. They’re of a similar size (within 3x) to the Hollywood screenwriters protests.
Twitter debuts a system (or modifies an existing system) to mark AI-generated video as AI-generated.
Sora, when prompted to play Minecraft, with some GPT-4 scaffolding and a keyboard/mouse screen overlay, can semi-competently play Minecraft. It mostly fails at fine motor control tasks, including aiming at trees, using the inventory, and similar. However, it plays at much slower than real time, as the API isn’t set up for this kind of one-frame-generation type of setup.
- No one has tried this afaik but it’d probably fail. When it generates Minecraft from scratch it hallucinates a lot so I’m guessing it wouldn’t be that good at playing it.

Unclear

GPT-5 or GPT-4.5 is released, which is noticeably more capable than GPT-4
- GPT-4o and o1 came out which broke the GPT-N pattern, but their capabilities are roughly what I’d expect from a GPT-4.5 model.
There are US headlines of (accusations of) AI-assisted election interference in a country with a population of at least 10M, probably the US. The interference is mostly done by flooding social media websites with semi-convincing fake personas (that a media-literate person can spot after 2 minutes of looking into them). Most of the bots make public posts and some DM people with personalized approaches (catering to people’s interests and opinions). It’s done using an open source or hidden state-owned model.
- The Joe Biden robocalls in New Hampshire were somewhat well-known but not big enough of a deal to make this resolve Correct.
DARPA announces the winners of the AI cyber challenge. They are very underwhelming to the alignment community (if we think about the results at all), not taking into account superhuman hacking abilities, but there are some good nuggets (progress toward quick automatic threat detection).
- Looks like the cyber challenge actually concludes in 2025.

Conclusion

The main pattern I notice looking back at my 2024 predictions was that benchmarks and capabilities increase quickly, but real world impacts (especially societal backlash and protests) are slower than I’d expect.

[-]yams5mo10

More partial credit on the second to last point:

https://home.treasury.gov/news/press-releases/jy2766

Aside: I don’t think it’s just that real world impacts take time to unfold. Lately I’ve felt that evals are only very weakly predictive of impact (because making great ones is extremely difficult). Could be that models available now don’t have substantially more mundane utility (economic potential stemming from first order effects), outside of the domains the labs are explicitly targeting (like math and code), than models available 1 year ago.

LESSWRONG
LW

19

Grading my 2024 AI predictions

19

Correct

Incorrect

Unclear

Conclusion

19