A very short version of this post, which seemed worth rattling of quickly for now.

 

A few months ago, I was talking to John about paradimicity in AI alignment. John says "we don't currently have a good paradigm." I asked "Is 'Natural Abstraction' a good paradigm?". He said "No, but I think it's something that's likely to output a paradigm that's closer to the right paradigm for AI Alignment."

"How many paradigms are we away from the right paradigm?"

"Like, I dunno, maybe 3?" said he.

Awhile later I saw John arguing on LessWrong with (I think?) Ryan Greenblatt about whether Ryan's current pseudo-paradigm was good. (Sorry if I got the names here or substance here wrong, I couldn't find the original thread, and it seemed slightly better to be specific so we could dig into a concrete example).

One distinction in the discussion seemed to be something like:

  • On one hand, Ryan thought his current paradigm (this might have been "AI Control", as contrasted with "AI Alignment") had a bunch of traction on producing a plan that would at least reasonably help if we had to align superintelligent AIs in the near future.
  • On the other hand, John argued that the paradigm didn't feel like the sort of thing that was likely to bear the fruit of new, better paradigms. It focused on an area of the superintelligence problem that, while locally tractable, John thought was insufficient to actually solve the problem, and also wasn't the sort of thing likely to pave the way to new paradigms.

Now a) again I'm not sure I'm remembering this conversation right, b) whether either of those points are true in this particular case would be up for debate and I'm not arguing they're true. (also, regardless, I am interested in the idea of AI Control and think that getting AI companies to actually do the steps necessary to control at least nearterm AIs is something worth putting effort into)

But it seemed good to promote to attention the idea that: when you're looking at clusters of AI Safety research and thinking about whether it is congealing into a useful, promising paradigm, one of the questions to ask is not just "does this paradigm seem locally tractable" but "do I have a sense that this paradigm will open up new lines of research that can lead to be better paradigms?".

(Whether one can be accurate in answering that question is yet another uncertainty. But, I think if you ask yourself "is this approach/paradigm useful", your brain will respond with different intuitions than "does this approach/paradigm seem likely to result in new/better paradigms?")

Some prior reading:

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 8:28 PM

What do you mean by paradigm? It's easy to get confused talking about paradigms.

I think the guiding principle behind whether or not scientific work is good should probably look something more like “is this getting me closer to understanding what’s happening” where “understanding” is something like “my measurements track the thing in a one to one lock-step with reality because I know the right typings and I’ve isolated the underlying causes well enough.”

AI control doesn’t seem like it’s making progress on that goal, which is certainly not to say it’s not important—it seems good to me to be putting some attention on locally useful things. Whereas the natural abstractions agenda does feel like progress on that front.

As an aside: I dislike basically all words about scientific progress at this point. I don’t feel like they’re precise enough and it seems easy to get satiated on them and lose track of what’s actually important which is, imo, absolute progress on the problem of understanding what the fuck is going on with minds. Calling this sort of work “science” risks lumping it in with every activity that happens in e.g., academia, and that isn’t right. Calling it “pre-paradigmatic” risks people writing it off as “Okay so people just sit around being confused for years? How could that be good?”

I wish we had better ways of talking about it. I think that more precisely articulating what our goals are with agent foundations/paradigmaticity/etc could be very helpful, not only for people pursuing it, but for others to even have a sense of what it might mean for field founding science to help in solving alignment. As it is, it seems to often get rounded off to “armchair philosophy” or “just being sort of perpetually confused” which seems bad.

Sorry if I got the names here or substance here wrong, I couldn't find the original thread, and it seemed slightly better to be specific so we could dig into a concrete example

FWIW, I don't seem to remember the exact conversation you mentioned (but it does sound sorta plausible). Also, I personally don't mind you using a fake example with me in it.

[Unimportant, but whatever] Quickly on the object level of the plausibly fictional conversation (lol):

had a bunch of traction on producing a plan that would at least reasonably help if we had to align superintelligent AIs in the near future.

I would more say "seems like it would reasonably help a lot in getting a huge amount of useful work out of AIs". (And then this work could plausibly help with aligning superintelligent AIs, but that isn't clearly the only or even main thing we're initially targeting.)

I would more say "seems like it would reasonably help a lot in getting a huge amount of useful work out of AIs". (And then this work could plausibly help with aligning superintelligent AIs, but that isn't clearly the only or even main thing we're initially targeting.)

Yeah I think if I thought more carefully before posting I'd have come up with this rephrasing myself. Matches my understanding of what you're going for.

Thanks for writing this! This is an idea that I think is pretty valuable and one that comes up fairly frequently when discussing different AI safety research agendas.

I think that there's a possibly useful analogue of this which is useful from the perspective of being deep inside a cluster of AI safety research and wondering whether it's good. Specifically, I think we should ask "does the value of my current line of research hinge on us basically being right about a bunch of things or does much of the research value come from discovering all the places we are wrong?".

One reason this feels like an important variant to me is that when I speak to people skeptical about the area of research I've been working in, they often seem surprised that I'm very much in agreement with them about a number of issues. Still, I disagree with them that the solution is to shift focus, so much as to try to work how the one paradigm might need to shift into another.