Trying to get into alignment. Have a low bar for reaching out!
247ca7912b6c1009065bade7c4ffbdb95ff4794b8dadaef41ba21238ef4af94b
When this paper came out, I don't think the results were very surprising to people who were paying attention to AI progress. However, it's important to the "obvious" research and demos to share with the wider world, and I think Apollo did a good job with their paper.
TL; DR: This post gives a good summary of how models can get smarter over time, but while they are superhuman at some tasks, they can still suck at others (see the chart with Naive Scenario v. Actual performance). This is a central dynamic in the development of machine intelligence and deserves more attention. Would love to hear other's thoughts on this—I just realized that it needed one more positive vote to end up in the official review.
In other words, current machine intelligence and human intelligence are compliments, and human + AI will be more productive than human-only or AI-only organizations (conditional on the same amount of resources).
The post sparked a ton of follow up questions for me, for example:
I've wanted to do a deep dive into this for a while now and keep putting it off.
I think many others have made the point about an uneven machine intelligence frontier (at least when referenced with the frontiers of human intelligence), but this is the first time I saw it so succinctly presented. I think this post warrents to be in the review, and if so it'll be a great motivator for me to write up my thoughts on the questions above!
OpenAI released another set of emails here. I haven't looked through them in detail but it seems that they contain some that are not already in this post.
Any event next week?
Yeah my view is that as long as our features/intermediate variables form human understandable circuits, it doesn't matter how "atomic" they are.
Almost certainly not original idea: Given the increasing fine-tuning access to models (see also the recent reinforcement fine tuning thing from OpenAI), see if fine tuning on goal directed agent tasks for a while leads to the types of scheming seen in the paper. You could maybe just fine tune on the model's own actions when successfully solving SWE-Bench problems or something.
(I think some of the Redwood folks might have already done something similar but haven't published it yet?)
What is the probability that the human race will NOT make it to 2100 without any catastrophe that wipes out more than 90% of humanity?
Could we have this question be phrased using no negations instead of two? Something like "What is the probability that there will be a global catastrophe that wipes out 90% or more of humanity before 2100."
Thanks for writing these posts Zvi <3 I've found them to be quite helpful.
Hi Clovis! Something that comes to mind is Zvi's dating roundup posts in case you haven't seen them yet.
I read this post in full back in February. It's very comprehensive. Thanks again to Zvi for compiling all of these.
To this day, it's infuriating that we don't have any explanation whatsoever from Microsoft/OpenAI on what went wrong with Bing Chat. Bing clearly did a bunch of actions its creators did not want. Why? Bing Chat would be a great model organism of misalignment. I'd be especially eager to run interpretability experiments on it.
The whole Bing chat fiasco is also gave me the impetus to look deeper into AI safety (although I think absent Bing, I would've came around to it eventually).