A very short version of this post, which seemed worth rattling of quickly for now.
A few months ago, I was talking to John about paradimicity in AI alignment. John says "we don't currently have a good paradigm." I asked "Is 'Natural Abstraction' a good paradigm?". He said "No, but I think it's something that's likely to output a paradigm that's closer to the right paradigm for AI Alignment."
"How many paradigms are we away from the right paradigm?"
"Like, I dunno, maybe 3?" said he.
Awhile later I saw John arguing on LessWrong with (I think?) Ryan Greenblatt about whether Ryan's current pseudo-paradigm was good. (Sorry if I got the names here or substance here wrong, I couldn't find the original thread, and it seemed slightly better to be specific so we could dig into a concrete example).
One distinction in the discussion seemed to be something like:
- On one hand, Ryan thought his current paradigm (this might have been "AI Control", as contrasted with "AI Alignment") had a bunch of traction on producing a plan that would at least reasonably help if we had to align superintelligent AIs in the near future.
- On the other hand, John argued that the paradigm didn't feel like the sort of thing that was likely to bear the fruit of new, better paradigms. It focused on an area of the superintelligence problem that, while locally tractable, John thought was insufficient to actually solve the problem, and also wasn't the sort of thing likely to pave the way to new paradigms.
Now a) again I'm not sure I'm remembering this conversation right, b) whether either of those points are true in this particular case would be up for debate and I'm not arguing they're true. (also, regardless, I am interested in the idea of AI Control and think that getting AI companies to actually do the steps necessary to control at least nearterm AIs is something worth putting effort into)
But it seemed good to promote to attention the idea that: when you're looking at clusters of AI Safety research and thinking about whether it is congealing into a useful, promising paradigm, one of the questions to ask is not just "does this paradigm seem locally tractable" but "do I have a sense that this paradigm will open up new lines of research that can lead to be better paradigms?".
(Whether one can be accurate in answering that question is yet another uncertainty. But, I think if you ask yourself "is this approach/paradigm useful", your brain will respond with different intuitions than "does this approach/paradigm seem likely to result in new/better paradigms?")
Some prior reading:
One model that I'm currently holding is that Kuhnian paradigms are about how groups of people collectively decide that scientific work is good, which is distinct from how individual scientists do or should decide that scientific work is good. And collective agreement is way more easily reached via external criteria.
Which is to say, problems are what establishes a paradigm. It's way easier to get a group of people to agree that "thing no go", than it is to get them to agree on the inherent nature of thing-ness and go-ness. And when someone finally makes thing go, everyone looks around and kinda has to concede that, whatever their opinion was of that person's ontology, they sure did make thing go. (And then I think the Wentworth/Greenblatt discussion above is about whether the method used to make thing go will be useful for making other things go, which is indeed required for actually establishing a new paradigm.)
That said, I think that the way that an individual scientist decides what ideas to pursue should usually route though things more like “is this getting me closer to understanding what’s happening”, but that external people are going to track "are problems getting solved", and so it's probably a good idea for most of the individual scientists to occasionally reflect on how likely their ideas are to make progress on (paradigm-setting) problems.
(It is possible for the agreed-upon problem to be "everyone is confused", and possible for a new idea to simultaneously de-confused everyone, thus inducing a new paradigm. (You could say that this is what happened with the Church-Turing thesis.) But it's just pretty uncommon, because people's ontologies can be wildly different.)
When you say, "I think that more precisely articulating what our goals are with agent foundations/paradigmaticity/etc could be very helpful...", how compatible is that with more precisely articulating problems in agent foundations (whose solutions would be externally verifiable by most agent foundations researchers)?