Trying to get into alignment. Have a low bar for reaching out!
247ca7912b6c1009065bade7c4ffbdb95ff4794b8dadaef41ba21238ef4af94b
I think the alignment stress testing team should probably think about AI welfare more than they currently do, both because (1) it could be morally relevant and (2) it could be alignment-relevant. Not sure if anything concrete that would come out of that process, but I'm getting the vibe that this is not thought about enough.
since it's near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).
I'm putting some of my faith in low-rank decompositions of bilinear MLPs but I'll let you know if I make any real progress with it :)
This sounds like a plausible story for how (successful) prosaic interpretability can help us in the short to medium term! I would say though, I think more applied mech interp work could supplement prosaic interpretability's theories. For example, the reversal curse seems mostly explained by what little we know about how neural networks do factual recall. Theory on computation in superposition help explain why linear probes can recover arbitrary XORs of features.
Reading through your post gave me a chance to reflect on why I am currently interested in mech interp. Here's a few points where I think we differ:
Best of luck with your future research!
I think the actual answer is: the AI isn't smart enough and trips up a lot.
But I haven't seen a detailed write up anywhere that talks about why the AI trips up and what are the types of places where it trips up. It feels like all of the existing evals work optimize for legibility/reproducibility/being clearly defined. As a result, it's not measuring the one thing that I'm really interested in: why don't we have AI agents replacing workers. I suspect that some startup's internal doc on "why does our agent not work yet" would be super interesting to read and track over time.
I read this post in full back in February. It's very comprehensive. Thanks again to Zvi for compiling all of these.
To this day, it's infuriating that we don't have any explanation whatsoever from Microsoft/OpenAI on what went wrong with Bing Chat. Bing clearly did a bunch of actions its creators did not want. Why? Bing Chat would be a great model organism of misalignment. I'd be especially eager to run interpretability experiments on it.
The whole Bing chat fiasco is also gave me the impetus to look deeper into AI safety (although I think absent Bing, I would've came around to it eventually).
When this paper came out, I don't think the results were very surprising to people who were paying attention to AI progress. However, it's important to the "obvious" research and demos to share with the wider world, and I think Apollo did a good job with their paper.
TL; DR: This post gives a good summary of how models can get smarter over time, but while they are superhuman at some tasks, they can still suck at others (see the chart with Naive Scenario v. Actual performance). This is a central dynamic in the development of machine intelligence and deserves more attention. Would love to hear other's thoughts on this—I just realized that it needed one more positive vote to end up in the official review.
In other words, current machine intelligence and human intelligence are compliments, and human + AI will be more productive than human-only or AI-only organizations (conditional on the same amount of resources).
The post sparked a ton of follow up questions for me, for example:
I've wanted to do a deep dive into this for a while now and keep putting it off.
I think many others have made the point about an uneven machine intelligence frontier (at least when referenced with the frontiers of human intelligence), but this is the first time I saw it so succinctly presented. I think this post warrents to be in the review, and if so it'll be a great motivator for me to write up my thoughts on the questions above!
OpenAI released another set of emails here. I haven't looked through them in detail but it seems that they contain some that are not already in this post.
Yes