The most common, these days, is some variant of “train an AI to help with aligning AI”. Sometimes it’s “train an AI to interpret the internals of another AI”, sometimes it’s “train an AI to point out problems in another AI’s plan”, sometimes it’s “train an AI to help you design aligned AI”, etc. I would guess about 75% of newcomers from ML suggest some such variant as their first idea.
I don't think these are crazy or bad ideas at all—I'd be happy to steelman them with you at some point if you want. Certainly, we don't know how to make any of them work right now, but I think they are all reasonable directions to go down if one wants to work on the various open problems related to them. The problem—and this is what I would say to somebody if they came to me with these ideas—is that they're not so much “ideas for how to solve alignment” so much as “entire research topics unto themselves.”
I sometimes worry that ideas are prematurely rejected because they are not guaranteed to work, rather than because they are guaranteed not to work. In the end it might turn out that zero ideas are actually guaranteed to work and thus we are left with an assortment of not guaranteed to work ideas which are underdeveloped because some possible failure mode was found and thus the idea was abandoned early.
That is a problem in principle, but I'd guess that the perception of that problem mostly comes from a couple other phenomena.
First: I think a lot of people don't realize on a gut level that a solution which isn't robust is guaranteed to fail in practice. There are always unknown unknowns in a new domain; the presence of unknown unknowns may be the single highest-confidence claim we can make about AGI at this point. A strategy which fails the moment any surprise comes along is going to fail; robustness is necessary. Now, robustness is not the same as "guaranteed to work", but the two are easy to confuse. A lot of arguments of the form "ah but your strategy fails in case X" look like they're saying "the strategy is not guaranteed to work", but the actually-important content is "the strategy is not robust to <broad class of failures>"; the key is to think about how broadly the example-failure generalizes. (I think a common mistake newcomers make is to argue "but that particular failure isn't very likely", without thinking about how the failure mode generalizes or what other lack-of-robustness it implies.)
Second: I think it's very common for people to say "we just don't know whether X will work" when in fact we have enormous amounts of real-world evidence about close analogues of X. (This thread on the Sandwich Problem post is a central example.) People imagine that we need to run an Official Experiment in order to Know Things, and that's just not how the world actually works. In general, we have an enormous amount of relevant prior information from the world. But often all that prior information is not as legible as an Official Experiment, so it's harder to explain the argument. I think people confuse the lack of legibility with a lack of certainty.
The issue is that it's very difficult to reason correctly in the absence of an "Official Experiment"[1]. I think the alignment community is too quick to dismiss potentially useful ideas, and that the reasons for those dismissals are often wrong. E.g., I still don't think anyone's given a clear, mechanistic reason for why rewarding an RL agent for making you smile is bound to fail (as opposed to being a terrible idea that probably fails).
More precisely, it's very difficult to reason correctly even with many "Official Experiments", and nearly impossible to do so without any such experiments.
It's a preparadigmatic field. Nobody is going to prove beyond a shadow of a doubt that X fails, for exactly the same reasons that nobody is going to prove beyond a shadow of a doubt that X works. And that just doesn't matter very much, for decision-making purposes. If something looks unlikely to work, then the EV-maximizing move is to dismiss it and move on. Maybe one or two people work on the thing-which-is-unlikely-to-work in order to decorrelate their bets with everyone else, but mostly people should ignore things which are unlikely to work, especially if there's already one or two people looking closer at it.
If I understand your idea, you propose that new people will try to think of new ideas, and when they say "How about A?", someone more "mature" says, "No, that won't work because of X", then they say "How about B?", and get the response "No, that won't work because of Y", and so forth, until finally they say "How about Q?", and Q is something no-one has thought of before, and so is worth investigating.
It could be that a new Q is what's needed. But might it instead be that "won't work because of Y" is flawed, and what is needed is someone who can see that flaw? It doesn't seem like this proposal would encourage discovery of such a flaw, once the new person is accustomed to listening to the "mature" person's dismissal of "non-working" ideas.
This seems like it might be a situation where personal interaction is counterproductive. Of course the new person should learn something about past work. But it's easier to question that past work, and persist in trying to think of how to make B work, when the dismissals of B as not workable are in papers one is reading, rather than in personal conversation with a mentor.
If I understand your idea, you propose that new people will try to think of new ideas, and when they say "How about A?", someone more "mature" says, "No, that won't work because of X", then they say "How about B?", and get the response "No, that won't work because of Y", and so forth, until finally they say "How about Q?", and Q is something no-one has thought of before, and so is worth investigating.
Nope, that is not what I propose. I actually give my mentees pretty minimal object-level feedback at all, mainly because I don't want them in the habit of deferring to my judgement. When I did give them intensive feedback for a few days, it was explicitly for the purpose of "building a John model", and I chose that framing specifically to try to keep the "John model" separate from peoples' own models.
I generally think it's best to do these exercises with a peer group, not with someone whose judgement one will hesitate to question. (Although of course reading stuff by more experienced people - like e.g. List of Lethalities - or occasionally getting feedback from more experienced people is a useful sub-step along the way.) The target outcome is not that people will ask more experienced people to find holes in their plans, but rather that people will look for the holes in their own plans, and iterate independently.
It is not clear to me to what extent this was part of the "training shoulder advisors" exercise, but to me, possibly the most important part of it is to keep the advisors at distance from your own thinking. In particular, in my impression, it seems likely the alignment research has been on average harmed by too many people "training their shoulder Eliezers" and the shoulder advisors pushing them to think in a crude version of Eliezer's ontology.
I chose the "train a shoulder advisor" framing specifically to keep my/Eliezer's models separate from the participants' own models. And I do think this worked pretty well - I've had multiple conversations with a participant where they say something, I disagree with it, and then they say "yup, that's what my John model said" - implying that they did in fact disagree with their John model. (That's not quite direct evidence of maintaining a separate ontology, but it's adjacent.)
Overall, my very tentative and subjective impression is that the program shaved ~3 years off the median participant’s Path of Alignment Maturity; they seem-to-me to be coming up with project ideas about on par with a typical person 3 years further in. The shoulder John/Eliezer exercises were relatively costly and I don’t think most groups should try to duplicate them, but other than those I expect most of the MATS content can scale quite well, so in principle it should be possible to do this with a lot more people.
This seems like super amazing news! If this is true, your potential work improving and scaling this stuff seems clearly much higher EV than your research (in the next year, say) (on average time-invested, and also on the margin unless you're working on this way more than I thought). Do you agree; what are you planning?
Edit: relatedly, I think this is the highlight of the post and the title misses the point.
I still put higher EV on my technical research, because this isn't the only barrier to scaling research. Indeed, I expect technical work itself will be a necessary precondition to scaling research in the not-very-distant future; scaling research ultimately requires a paradigm, and paradigm discovery is a technical problem.
But I do think this is high-value, plans are under way to scale up for the next round of MATS, and I'm also hoping to figure out how to offload most of the work to other people.
Edit: relatedly, I think this is the highlight of the post and the title misses the point.
Yeah, a lot of my posts over the past month or two have been of frankly mediocre quality by my usual standards. I previously had a policy of mostly not talking directly about alignment strategy, critiques of other peoples' research, and other alignment meta stuff. For various reasons I've lately been writing down a bunch of it very quickly, which does trade off to some extent with quality. Hopefully I'll run out of such material soon and go back to writing better posts about more interesting things.
I noticed that you began this post by saying
Occasionally people say “hey, alignment research has lots of money behind it now, why not fund basically everyone who wants to try it?”.
But then the rest of the post did not directly address this question. You point out that most of “everyone who wants to try it” will start out with some (probably) flawed ideas, but that doesn’t seem like a good argument for not funding them. After all, you yourself experienced the growth you want others to go through while being funded yourself. I’d expect someone who is able to spend more of their time doing research (due to being able to focus on it full-time) will likely reach intellectual maturity on the topic faster than someone who has to focus on making a living in another area as well.
Mostly what I'm arguing for here is a whole different model, where newcomers are funded with a goal of getting them through the Path (probably with resources designed for that purpose), rather than relying on Alignment Maturity coming about accidentally as a side-effect of research.
(Also, minor point, I think I was most of the way through the Path by the time I got my first grant, so I actually did go through that growth before I had funding. But I don't think that's particularly relevant here.)
At the moment there's a plan to create The Berlin Hub as a coliving space for new AI safety researchers. What lessons do you think should be drawn from the thesis you laid out for that project? Do you believe that the peer review that happens through that environment will push people on the [ath forward or would you fear that a lot of people at the Hub would do work that doesn't matter?
This is extremely difficult. Some good literature on cooperative living worth reading because there are countless common pitfalls. Also being a research org at the same time is quite ambitious. Good luck!
do you happen to have additional references besides those words to find literature on cooperative living?
Had some books at previous coop — might have been these.
https://www.ic.org/community-bookstore/product/wisdom-of-communities-complete-set/
Many practicalities with admitting good members, dealing with problematic members, keeping the kitchen sink clean, keeping the floors clean, keeping track of rent, doing repairs, etc. Some of this is alleviated if you have a big budget. Culture is extremely tricky. It is extremely rewarding when it works.
Visiting a coop for even a week reveals quite a bit about how it works — if you haven't done that already
The main immediate advice I'd give is to look at people switching projects/problems/ideas as a key metric. Obviously that's not a super-robust proxy and will break down if people start optimizing for it directly. But insofar as changes in which projects people work on are driven by updates to their underlying models, it's a pretty good metric of progress down the Path.
At this point, I still have a lot of uncertainty about things which will work well or not work well for accelerating people down the Path; it looks tractable, but that doesn't mean that it's clear yet what the best methods are. Trying things and seeing what causes people to update a lot seems like a generally good approach.
To clarify, basically anyone who actually wants to try to work on alignment full time who is at all promising and willing to learn is already getting funded to upskill and investigate for at least a few months. The question here is "why not fund them to do X, if they suggest it," and my answer is that if they only thing they are interested in is X, and X is one of the things John listed above, they aren't going to get funded unless they have a really great argument. And most don't, and either they take feedback and come up with another idea. I suggest they upskill and learn more, or they decide to do something else.
Upvoted for changing my mind on how exactly to do this, and discussing the promising solutions!
(Also, somebody doing "shoulder John" with me at EAG, helped turn me away from my bad ideas and work on learning the key problems in more depth. That plus a random meeting with Yudkowsky. TLDR: my research ideas were actually research topics, instead of ideas.)
One possible failure mode I can imagine with this approach:
Suppose some important part of the conventional wisdom among you and/or your collaborators is wrong. It is likely that this flaw, if it exists, becomes more difficult for a researcher to discover if they have already heard a plausible-sounding explanation for the discrepancy. So by providing an "accelerated" path for new alignment researchers, you may reduce the likelihood that the error will be discovered.
This risk may be justifiable if enough smart researchers continue to work towards an understanding of alignment without taking this "accelerated track". But my own observations in the context of high-prestige internships suggest that programs like this will be seen as a "fast track" to success in the field, and the most talented students will compete for entry.
It sounds like you are picturing an implementation which is not actually what I'd recommend. I believe this comment already responds to basically the same concern, but let me know if you're saying something not covered by that.
I would love to see you say why you consider these bad ideas. Obvious such AI's could be unaligned themselves or is it more along the lines of these assistants needing a complete model of human values to be truly useful?
John's Why Not Just... sequence is a series of somewhat rough takes on a few of them. (though I think many of them are not written up super comprehensively)
This is true in every field and is very difficult to systemize apparently. Perhaps a highly unstable social state to have people changing directions or thinking/speaking super honestly often.
How could one succeed where so few have?
There does seem to be a sizeable amount of broken-record type repetition on this topic. Though the phenomena occurs in all advanced fields with public attention.
Epistemic status: lots of highly subjective and tentative personal impressions.
Occasionally people say “hey, alignment research has lots of money behind it now, why not fund basically everyone who wants to try it?”. Often this involves an analogy to venture capital: alignment funding is hits-based (i.e. the best few people are much more productive than everyone else combined), funders aren’t actually that good at distinguishing the future hits, so what we want is a whole bunch of uncorrelated bets.
The main place where this fails, in practice, is the “uncorrelated” part. It turns out that most newcomers to alignment have the same few Bad Ideas.
The most common, these days, is some variant of “train an AI to help with aligning AI”. Sometimes it’s “train an AI to interpret the internals of another AI”, sometimes it’s “train an AI to point out problems in another AI’s plan”, sometimes it’s “train an AI to help you design aligned AI”, etc. I would guess about 75% of newcomers from ML suggest some such variant as their first idea.
People who are less aware of standard alignment arguments tend to start with “train an AI on human feedback” or “iterate until the problems go away”. In the old days, pre-sequences, people started from even worse ideas; at least the waterline has risen somewhat.
People with more of a theory bent or an old-school AI background tend to reinvent IRL or CIRL variants. (A CIRL variant was my own starting Bad Idea - you can read about it in this post from 2020, although the notes from which that post was written were from about 2016-2017.)
My impression (based on very limited data) is that it takes most newcomers ~5 years to go from their initial Bad Idea to actually working on something plausibly useful. For lack of a better name, let’s call that process the Path of Alignment Maturity.
My impression is that progress along the Path of Alignment Maturity can be accelerated dramatically by actively looking for problems with your own plans - e.g. the builder/breaker framework from the Eliciting Latent Knowledge doc, or some version of the Alignment Game Tree exercise, or having a group of people who argue and poke holes in each others’ plans. (Of course these all first require not being too emotionally attached to your own plan; it helps a lot if you can come up with a second or third line of attack, thereby building confidence that there’s something else to move on to.) It can also be accelerated by starting with some background knowledge of difficult problems adjacent to alignment/agency - I notice philosophers tend to make unusually fast progress down the Path that way, and I think prior experience with adjacent problems also cut about 3-4 years off the Path for me. (To be clear, I don’t necessarily recommend that as a strategy for a newcomer - I spent ~5 years working on agency-adjacent problems before working on alignment, and that only cut ~3-4 years off my Path of Alignment Maturity. That wasn’t the only alignment-related value I gained from my background knowledge, but the faster progress down the Path was not worthwhile on its own.) General background experience/knowledge about the world also helps a lot - e.g. I expect someone who's founded and worked at a few startups will make faster progress than someone who’s only worked at one big company, and either of those will make faster progress than someone who’s never been outside of academia.
On the flip side, I expect that progress down the Path of Alignment Maturity is slower for people who spend their time heads-down in the technical details of a particular approach, and spend less time reflecting on whether it’s the right approach at all or arguing with people who have very different models. I’d guess that this is especially a problem for people at orgs with alignment work focused on specific agendas - e.g. I’d guess progress down the Path is slower at Redwood or OpenAI, but faster at Conjecture or Deepmind (because those orgs have a relatively high variety of alignment models internally, as I understand it).
I think accelerating newcomers’ progress down the Path of Alignment Maturity is one of the most tractable places where community builders and training programs can add a lot of value. I’ve been training about a dozen people through the MATS program this summer, and I currently think accelerating participants’ progress down the Path has been the biggest success. We had a lot of content aimed at that: the Alignment Game Tree, two days of the “train a shoulder John” exercise plus a third day of the same exercise with Eliezer, the less formal process of people organized into teams kicking ideas around and arguing with each other, and of course general encouragement to pivot to new problems and strategies (which most people did multiple times). Overall, my very tentative and subjective impression is that the program shaved ~3 years off the median participant’s Path of Alignment Maturity; they seem-to-me to be coming up with project ideas about on par with a typical person 3 years further in. The shoulder John/Eliezer exercises were relatively costly and I don’t think most groups should try to duplicate them, but other than those I expect most of the MATS content can scale quite well, so in principle it should be possible to do this with a lot more people.