johnswentworth

Sequences

From Atoms To Agents
"Why Not Just..."
Basic Foundations for Agent Models
Framing Practicum
Gears Which Turn The World
Abstraction 2020
Gears of Aging
Model Comparison

Wiki Contributions

Comments

Sorted by

Kudos for correctly identifying the main cruxy point here, even though I didn't talk about it directly.

The main reason I use the term "propaganda" here is that it's an accurate description of the useful function of such papers, i.e. to convince people of things, as opposed to directly advancing our cutting-edge understanding/tools. The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things, and that applies to these papers as well.

And I would say that people are usually correct to not update much on empirical findings! Not Measuring What You Think You Are Measuring is a very strong default, especially among the type of papers we're talking about here.

The big names do tend to have disproportionate weight, but they're few in number, and when a big name promotes something the median researcher doesn't understand everyone just kind of shrugs and ignores them. Memeticity selects who the big names are, much more so than vice-versa.

All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI). There are perhaps some versions of all four which could be useful, but those versions do not resemble any work I've ever heard of anyone actually doing in any of those categories.

That said, many of those do plausibly produce value as propaganda for the political cause of AI safety, especially insofar as they involve demoing scary behaviors.

EDIT-TO-ADD: Actually, I guess I do think the singular learning theorists are headed in a useful direction, and that does fall under your "science of generalization" category. Though most of the potential value of that thread is still in interp, not so much black-box calculation of RLCTs.

If you're thinking mainly about interp, then I basically agree with what you've been saying. I don't usually think of interp as part of "prosaic alignment", it's quite different in terms of culture and mindset and it's much closer to what I imagine a non-streetlight-y field of alignment would look like. 90% of it is crap (usually in streetlight-y ways), but the memetic selection pressures don't seem too bad.

If we had about 10x more time than it looks like we have, then I'd say the field of interp is plausibly on track to handle the core problems of alignment.

This is the sort of object-level discussion I don't want on this post, but I've left a comment on Fabien's list.

Someone asked what I thought of these, so I'm leaving a comment here. It's kind of a drive-by take, which I wouldn't normally leave without more careful consideration and double-checking of the papers, but the question was asked so I'm giving my current best answer.

First, I'd separate the typical value prop of these sort of papers into two categories:

  • Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
  • Object-level: gets us closer to aligning substantially-smarter-than-human AGI, either directly or indirectly (e.g. by making it easier/safer to use weaker AI for the problem).

My take: many of these papers have some value as propaganda. Almost all of them provide basically-zero object-level progress toward aligning substantially-smarter-than-human AGI, either directly or indirectly.

Notable exceptions:

  • Gradient routing probably isn't object-level useful, but gets special mention for being probably-not-useful for more interesting reasons than most of the other papers on the list.
  • Sparse feature circuits is the right type-of-thing to be object-level useful, though not sure how well it actually works.
  • Better SAEs are not a bottleneck at this point, but there's some marginal object-level value there.

My impression, from conversations with many people, is that the claim which gets clamped to True is not "this research direction will/can solve alignment" but instead "my research is high value". So when I've explained to someone why their current direction is utterly insufficient, they usually won't deny some class of problems. They'll instead tell me that the research still seems valuable even though it isn't addressing a bottleneck, or that their research is maybe a useful part of a bigger solution which involves many other parts, or that their research is maybe useful step toward something better.

(Though admittedly I usually try to "meet people where they're at", by presenting failure-modes which won't parse as weird to them. If you're just directly explaining e.g. dangers of internal RSI, I can see where people might instead just assume away internal RSI or some such.)

... and then if I were really putting in effort, I'd need to explain that e.g. being a useful part of a bigger solution (which they don't know the details of) is itself a rather difficult design constraint which they have not at all done the work to satisfy. But usually I wrap up the discussion well before that point; I generally expect that at most one big takeaway from a discussion can stick, and if they already have one then I don't want to overdo it.

This post isn't intended to convince anyone at all that people are in fact streetlighting. This post is intended to present my own models and best guesses at what to do about it to people who are already convinced that most researchers in the field are streetlighting. They are the audience.

I think you have two main points here, which require two separate responses. I'll do them opposite the order you presented them.

Your second point, paraphrased: 90% of anything is crap, that doesn't mean there's no progress. I'm totally on board with that. But in alignment today, it's not just that 90% of the work is crap, it's that the most memetically successful work is crap. It's not the raw volume of crap that's the issue so much as the memetic selection pressures.

Your first point, paraphrased: progress toward the the hard problem does not necessarily immediately look like tackling the meat of the hard problem directly. I buy that to some extent, but there are plenty of cases where we can look at what people are doing and see pretty clearly that it is not progress toward the hard problem, whether direct or otherwise. And indeed, I would claim that prosaic alignment as a category is a case where people are not making progress on the hard problems, whether direct or otherwise. In particular, one relevant criterion to look at here is generalizability: is the work being done sufficiently general/robust that it will still be relevant once the rest of the problem is solved (and multiple things change in not-yet-predictable ways in order to solve the rest of the problem)? See e.g. this recent comment for an object-level example of what I mean.

From the post:

... but crucially, the details of the rationalizations aren't that relevant to this post. Someone who's flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they'll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.

Load More