A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is".

Over the past few years, a major source of my relative optimism on AI has been the hope that the field of alignment would transition from pre-paradigmatic to paradigmatic, and make much more rapid progress.

At this point, that hope is basically dead. There has been some degree of paradigm formation, but the memetic competition has mostly been won by streetlighting: the large majority of AI Safety researchers and activists are focused on searching for their metaphorical keys under the streetlight. The memetically-successful strategy in the field is to tackle problems which are easy, rather than problems which are plausible bottlenecks to humanity’s survival. That pattern of memetic fitness looks likely to continue to dominate the field going forward.

This post is on my best models of how we got here, and what to do next.

What This Post Is And Isn't, And An Apology

This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post. In particular, probably the large majority of people in the field have some story about how their work is not searching under the metaphorical streetlight, or some reason why searching under the streetlight is in fact the right thing for them to do, or [...].

The kind and prosocial version of this post would first walk through every single one of those stories and argue against them at the object level, to establish that alignment researchers are in fact mostly streetlighting (and review how and why streetlighting is bad). Unfortunately that post would be hundreds of pages long, and nobody is ever going to get around to writing it. So instead, I'll link to:

(Also I might link some more in the comments section.) Please go have the object-level arguments there rather than rehashing everything here.

Next comes the really brutally unkind part: the subject of this post necessarily involves modeling what's going on in researchers' heads, such that they end up streetlighting. That means I'm going to have to speculate about how lots of researchers are being stupid internally, when those researchers themselves would probably say that they are not being stupid at all and I'm being totally unfair. And then when they try to defend themselves in the comments below, I'm going to say "please go have the object-level argument on the posts linked above, rather than rehashing hundreds of different arguments here". To all those researchers: yup, from your perspective I am in fact being very unfair, and I'm sorry. You are not the intended audience of this post, I am basically treating you like a child and saying "quiet please, the grownups are talking", but the grownups in question are talking about you and in fact I'm trash talking your research pretty badly, and that is not fair to you at all.

But it is important, and this post just isn't going to get done any other way. Again, I'm sorry.

Why The Streetlighting?

A Selection Model

First and largest piece of the puzzle: selection effects favor people doing easy things, regardless of whether the easy things are in fact the right things to focus on. (Note that, under this model, it's totally possible that the easy things are the right things to focus on!)

What does that look like in practice? Imagine two new alignment researchers, Alice and Bob, fresh out of a CS program at a mid-tier university. Both go into MATS or AI Safety Camp or get a short grant or [...]. Alice is excited about the eliciting latent knowledge (ELK) doc, and spends a few months working on it. Bob is excited about debate, and spends a few months working on it. At the end of those few months, Alice has a much better understanding of how and why ELK is hard, has correctly realized that she has no traction on it at all, and pivots to working on technical governance. Bob, meanwhile, has some toy but tangible outputs, and feels like he's making progress.

... of course (I would say) Bob has not made any progress toward solving any probable bottleneck problem of AI alignment, but he has tangible outputs and is making progress on something, so he'll probably keep going.

And that's what the selection pressure model looks like in practice. Alice is working on something hard, correctly realizes that she has no traction, and stops. (Or maybe she just keeps spinning her wheels until she burns out, or funders correctly see that she has no outputs and stop funding her.) Bob is working on something easy, he has tangible outputs and feels like he's making progress, so he keeps going and funders keep funding him. How much impact Bob's work has impact on humanity's survival is very hard to measure, but the fact that he's making progress on something is easy to measure, and the selection pressure rewards that easy metric.

Generalize this story across a whole field, and we end up with most of the field focused on things which are easy, regardless of whether those things are valuable.

Selection and the Labs

Here's a special case of the selection model which I think is worth highlighting.

Let's start with a hypothetical CEO of a hypothetical AI lab, who (for no particular reason) we'll call Sam. Sam wants to win the race to AGI, but also needs an AI Safety Strategy. Maybe he needs the safety strategy as a political fig leaf, or maybe he's honestly concerned but not very good at not-rationalizing. Either way, he meets with two prominent AI safety thinkers - let's call them (again for no particular reason) Eliezer and Paul. Both are clearly pretty smart, but they have very different models of AI and its risks. It turns out that Eliezer's model predicts that alignment is very difficult and totally incompatible with racing to AGI. Paul's model... if you squint just right, you could maybe argue that racing toward AGI is sometimes a good thing under Paul's model? Lo and behold, Sam endorses Paul's model as the Official Company AI Safety Model of his AI lab, and continues racing toward AGI. (Actually the version which eventually percolates through Sam's lab is not even Paul's actual model, it's a quite different version which just-so-happens to be even friendlier to racing toward AGI.)

A "Flinching Away" Model

While selection for researchers working on easy problems is one big central piece, I don't think it fully explains how the field ends up focused on easy things in practice. Even looking at individual newcomers to the field, there's usually a tendency to gravitate toward easy things and away from hard things. What does that look like?

Carol follows a similar path to Alice: she's interested in the Eliciting Latent Knowledge problem, and starts to dig into it, but hasn't really understood it much yet. At some point, she notices a deep difficulty introduced by sensor tampering - in extreme cases it makes problems undetectable, which breaks the iterative problem-solving loop, breaks ease of validation, destroys potential training signals, etc. And then she briefly wonders if the problem could somehow be tackled without relying on accurate feedback from the sensors at all. At that point, I would say that Carol is thinking about the real core ELK problem for the first time.

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems. At that point, I would say that Carol is streetlighting.

It's the reflexive flinch which, on this model, comes first. After that will come rationalizations. Some common variants:

  • Carol explicitly introduces some assumption simplifying the problem, and claims that without the assumption the problem is impossible. (Ray's workshop on one-shotting Baba Is You levels apparently reproduced this phenomenon very reliably.)
  • Carol explicitly says that she's not trying to solve the full problem, but hopefully the easier version will make useful marginal progress.
  • Carol explicitly says that her work on easier problems is only intended to help with near-term AI, and hopefully those AIs will be able to solve the harder problems.
  • (Most common) Carol just doesn't think about the fact that the easier problems don't really get us any closer to aligning superintelligence. Her social circles act like her work is useful somehow, and that's all the encouragement she needs.

... but crucially, the details of the rationalizations aren't that relevant to this post. Someone who's flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they'll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.

Which brings us to the "what to do about it" part of the post.

What To Do About It

Let's say we were starting a new field of alignment from scratch. How could we avoid the streetlighting problem, assuming the models above capture the core gears?

First key thing to notice: in our opening example with Alice and Bob, Alice correctly realized that she had no traction on the problem. If the field is to be useful, then somewhere along the way someone needs to actually have traction on the hard problems.

Second key thing to notice: if someone actually has traction on the hard problems, then the "flinching away" failure mode is probably circumvented.

So one obvious thing to focus on is getting traction on the problems.

... and in my experience, there are people who can get traction on the core hard problems. Most notably physicists - when they grok the hard parts, they tend to immediately see footholds, rather than a blank impassable wall. I'm picturing here e.g. the sort of crowd at the ILLIAD conference; these were people who mostly did not seem at risk of flinching away, because they saw routes to tackle the problems. (Though to be clear, though ILLIAD was a theory conference, I do not mean to imply that it's only theorists who ever have any traction.) And they weren't being selected away, because many of them were in fact doing work and making progress.

Ok, so if there are a decent number of people who can get traction, why do the large majority of the people I talk to seem to be flinching away from the hard parts? 

How We Got Here

The main problem, according to me, is the EA recruiting pipeline.

On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportionately STEM-focused undergrads, and looks pretty typical for STEM-focused undergrads. We’re talking about pretty standard STEM majors from pretty standard schools, neither the very high end nor the very low end of the skill spectrum.

... and that's just not a high enough skill level for people to look at the core hard problems of alignment and see footholds.

Who To Recruit Instead

We do not need pretty standard STEM-focused undergrads from pretty standard schools. In practice, the level of smarts and technical knowledge needed to gain any traction on the core hard problems seems to be roughly "physics postdoc". Obviously that doesn't mean we exclusively want physics postdocs - I personally have only an undergrad degree, though amusingly a list of stuff I studied has been called "uncannily similar to a recommendations to readers to roll up their own doctorate program". Point is, it's the rough level of smarts and skills which matters, not the sheepskin. (And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)

As an alternative to recruiting people who have the skills already, one could instead try to train people. I've tried that to some extent, and at this point I think there just isn't a substitute for years of technical study. People need that background knowledge in order to see footholds on the core hard problems.

Integration vs Separation

Last big piece: if one were to recruit a bunch of physicists to work on alignment, I think it would be useful for them to form a community mostly-separate from the current field. They need a memetic environment which will amplify progress on core hard problems, rather than... well, all the stuff that's currently amplified.

This is a problem which might solve itself, if a bunch of physicists move into alignment work. Heck, we've already seen it to a very limited extent with the ILLIAD conference itself. Turns out people working on the core problems want to talk to other people working on the core problems. But the process could perhaps be accelerated a lot with more dedicated venues.

New Comment
23 comments, sorted by Click to highlight new comments since:

My guess is a roughly equally "central" problem is the incentive landscape around the OpenPhil/Anthropic school of thought 

  • where you see Sam, I suspect something like "the lab memeplexes". Lab superagents have instrumental  convergent goals, and the instrumental convergent goals lead to instrumental, convergent beliefs, and also to instrumental blindspots
  • there are strong incentives for individual people to adjust their beliefs: money, social status, sense of importance via being close to the Ring
  • there are also incentives for people setting some of the incentives: funding something making progress on something seems more successful and easier than funding the dreaded theory 
[-]aysja5620

I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess. I do think it takes an unusual skillset, though, which is where most of the trouble lives. I.e., I think the pre-paradigmatic skillset requires unusually strong epistemics (because you often need to track for yourself what makes sense), ~creativity (the ability to synthesize new concepts, to generate genuinely novel hypotheses/ideas), good ability to traverse levels of abstraction (connecting details to large level structure, this is especially important for the alignment problem), not being efficient market pilled (you have to believe that more is possible in order to aim for it), noticing confusion, and probably a lot more that I’m failing to name here.

Most importantly, though, I think it requires quite a lot of willingness to remain confused. Many scientists who accomplished great things (Darwin, Einstein) didn’t have publishable results on their main inquiry for years. Einstein, for instance, talks about wandering off for weeks in a state of “psychic tension” in his youth, it took ~ten years to go from his first inkling of relativity to special relativity, and he nearly gave up at many points (including the week before he figured it out). Figuring out knowledge at the edge of human understanding can just be… really fucking brutal. I feel like this is largely forgotten, or ignored, or just not understood. Partially that's because in retrospect everything looks obvious, so it doesn’t seem like it could have been that hard, but partially it's because almost no one tries to do this sort of work, so there aren't societal structures erected around it, and hence little collective understanding of what it's like.

Anyway, I suspect there are really strong selection pressures for who ends up doing this sort of thing, since a lot needs to go right: smart enough, creative enough, strong epistemics, independent, willing to spend years without legible output, exceptionally driven, and so on. Indeed, the last point seems important to me—many great scientists are obsessed. Spend night and day on it, it’s in their shower thoughts, can’t put it down kind of obsessed. And I suspect this sort of has to be true because something has to motivate them to go against every conceivable pressure (social, financial, psychological) and pursue the strange meaning anyway.

I don’t think the EA pipeline is much selecting for pre-paradigmatic scientists, but I don’t think lack of trying to get physicists to work on alignment is really the bottleneck either. Mostly I think selection effects are very strong, e.g., the Sequences was, imo, one of the more effective recruiting strategies for alignment. I don’t really know what to recommend here, but I think I would anti-recommend putting all the physics post-docs from good universities in a room in the hope that they make progress. Requesting that the world write another book as good as the Sequences is a... big ask, although to the extent it’s possible I expect it’ll go much further in drawing people out who will self select into this rather unusual "job."

Epistemic status: This is a work of satire. I mean it---it is a mean-spirited and unfair assessment of the situation. It is also how, some days, I sincerely feel.

A minivan is driving down a mountain road, headed towards a cliff's edge with no guardrails. The driver floors the accelerator.

Passenger 1: "Perhaps we should slow down somewhat."

Passengers 2, 3, 4: "Yeah, that seems sensible."

Driver: "No can do. We're about to be late to the wedding."

Passenger 2: "Since the driver won't slow down, I should work on building rocket boosters so that (when we inevitably go flying off the cliff edge) the van can fly us to the wedding instead."

Passenger 3: "That seems expensive."

Passenger 2: "No worries, I've hooked up some funding from Acceleration Capital. With a few hours of tinkering we should get it done."

Passenger 1: "Hey, doesn't Acceleration Capital just want vehicles to accelerate, without regard to safety?"

Passenger 2: "Sure, but we'll steer the funding such that the money goes to building safe and controllable rocket boosters."

The van doesn't slow down. The cliff looks closer now.

Passenger 3: [looking at what Passenger 2 is building] "Uh, haven't you just made a faster engine?"

Passenger 2: "Don't worry, the engine is part of the fundamental technical knowledge we'll need to build the rockets. Also, the grant I got was for building motors, so we kinda have to build one."

Driver: "Awesome, we're gonna get to the wedding even sooner!" [Grabs the engine and installs it. The van speeds up.]

Passenger 1: "We're even less safe now!"

Passenger 3: "I'm going to start thinking about ways to manipulate the laws of physics such that (when we inevitably go flying off the cliff edge) I can manage to land us safely in the ocean."

Passenger 4: "That seems theoretical and intractable. I'm going to study the engine to figure out just how it's accelerating at such a frightening rate. If we understand the inner workings of the engine, we should be able to build a better engine that is more responsive to steering, therefore saving us from the cliff."

Passenger 1: "Uh, good luck with that, I guess?"

Nothing changes. The cliff is looming.

Passenger 1: "We're gonna die if we don't stop accelerating!"

Passenger 2: "I'm gonna finish the rockets after a few more iterations of making engines. Promise."

Passenger 3: "I think I have a general theory of relativity as it relates to the van worked out..."

Passenger 4: "If we adjust the gear ratio... Maybe add a smart accelerometer?"

Driver: "Look, we can discuss the benefits and detriments of acceleration over hors d'oeuvres at the wedding, okay?"

Driver: My map doesn't show any cliffs

Passenger 1: Have you turned on the terrain map? Mine shows a sharp turn next to a steep drop coming up in about a mile

Passenger 5: Guys maybe we should look out the windshield instead of down at our maps?

Driver: No, passenger 1, see on your map that's an alternate route, the route we're on doesn't show any cliffs.

Passenger 1: You don't have it set to show terrain.

Passenger 6: I'm on the phone with the governor now, we're talking about what it would take to set a 5 mile per hour national speed limit.

Passenger 7: Don't you live in a different state?

Passenger 5: The road seems to be going up into the mountains, though all the curves I can see from here are gentle and smooth.

Driver and all passengers in unison: Shut up passenger 5, we're trying to figure out if we're going to fall off a cliff here, and if so what we should do about it.

Passenger 7: Anyway, I think what we really need to do to ensure our safety is to outlaw automobiles entirely.

Passenger 3: The highest point on Earth is 8849m above sea level, and the lowest point is 430 meters below sea level, so the cliff in front of us could be as high as 9279m.

unfortunately, the disanalogy is that any driver who moves their foot towards the brakes is almost instantly replaced with one who won't.

(Prefatory disclaimer that, admittedly as an outsider to this field, I absolutely disagree with the labeling of prosaic AI work as useless streetlighting, for reasons building upon what many commenters wrote in response to the very posts you linked here as assumed background material. But in the spirit of your post, I shall ignore that moving forward.)

The "What to Do About It" section dances around but doesn't explicitly name one of the core challenges of theoretical agent-foundations work that aims to solve the "hard bits" of the alignment challenge, namely the seeming lack of reliable feedback loops that give you some indication that you are pushing towards something practically useful in the end instead of just a bunch of cool math that nonetheless resides alone in its separate magisterium. As Conor Leahy concisely put it:

Humans are really, really bad at doing long chains of abstract reasoning without regular contact with reality, so in practice imo good philosophy has to have feedback loops with reality, otherwise you will get confused.

He was talking about philosophy in particular at that juncture, in response to Wei Dai's concerns over metaphilosophical competence, but this point seems to me to generalize to a whole bunch of other areas as well. Indeed, I have talked about this before.

... and in my experience, there are people who can get traction on the core hard problems. Most notably physicists - when they grok the hard parts, they tend to immediately see footholds, rather than a blank impassable wall.

Do they get traction on "core hard problems" because of how Inherently Awesome they are as researchers, or do they do so because the types of physics problems we mostly care about currently are such that, while the generation of (worthwhile) grand mathematical theories is hard, verifying them is (comparatively) easier because we can run a bunch of experiments (or observe astronomical data etc., in the super-macro scale) to see if the answers they spit out comply with reality? I am aware of your general perspective on this matter, but I just... still completely disagree, for reasons other people have pointed out (see also Vanessa Kosoy's comment here). Is this also supposed to be an implicitly assumed bit of background material?

And when we don't have those verifying experiments at hand, do we not get stuff like string theory, where the math is beautiful and exquisite (in the domains it has been extended do) but debate by "physics postdocs" over whether it's worthwhile to keep funding and pursuing it keeps raging on as a Theory of Everything keeps eliding our grasp? I'm sure people with more object-level expertise on this can correct my potential misconceptions if need be.

Idk man, some days I'm half-tempted to believe that all non-prosaic alignment work is a bunch of "streetlighting." Yeah, it doesn't result in the kind of flashy papers full of concrete examples about current models that typically get associated with the term-in-scare-quotes. But it sure seems to cover itself in a veneer of respectability by giving a (to me) entirely unjustified appearance of rigor and mathematical precision and robustness to claims about what will happen in the real world based on nothing more than a bunch of vibing about toy models that assume away the burdensome real-world details serving as evidence whether the approaches are even on the right track. A bunch of models that seem both woefully underpowered for the Wicked Problems they must solve and also destined to underfit their target, for they (currently) all exist and supposedly apply independently of the particular architecture, algorithms, training data, scaffolding etc., that will result in the first patch of really powerful AIs. The contents and success stories of Vanessa Kosoy's desiderata, or of your own search for natural abstractions, or of Alex Altair's essence of agent foundations, or of Orthogonal's QACI, etc., seem entirely insensitive to the fact that we are currently dealing with multimodal LLMs combined with RL instead of some other paradigm, which in my mind almost surely disqualifies them as useful-in-the-real-world when the endgame hits.

There's a famous Eliezer quote about how for every correct answer to a precisely-stated problem, there are a million times more wrong answers one could have given instead. I would build on that to say that for every powerfully predictive, but lossy and reductive mathematical model of a complex real-world system, there are a million times more similar-looking mathematical models that fail to capture the essence of the problem and ultimately don't generalize well at all. And it's only by grounding yourself to reality and hugging the query tight by engaging with real-world empirics that you can figure out if the approach you've chosen is in the former category as opposed to the latter.

(I'm briefly noting that I don't fully endorse everything I said in the previous 2 paragraphs, and I realize that my framing is at least a bit confrontational and unfair. Separately, I acknowledge the existence of arguably-non-prosaic and mostly theoretical alignment approaches like davidad's Open Agency Architecture, CHAI's CIRL and utility uncertainty, Steve Byrnes's work on brain-like AGI safety, etc., that don't necessarily appear to fit this mold. I have varying opinions on the usefulness and viability of such approaches.)

I actually disagree with the natural abstractions research being ungrounded. Indeed, I think there is reason to believe that at least some of the natural abstractions work, especially the natural abstraction hypothesis actually sorts of holds true for today's AI, and thus is the most likely out of the theoretical/agent-foundation approaches to work (I'm usually critical to agent foundations, but John Wentworth's work is an exception that I'd like funding for).

For example, this post does an experiment that shows that OOD data still makes the Platonic Representation Hypothesis true, meaning that it's likely that deeper factors are at play than just shallow similarity:

https://www.lesswrong.com/posts/Su2pg7iwBM55yjQdt/exploring-the-platonic-representation-hypothesis-beyond-in

I'm wary of a possible equivocation about what the "natural abstraction hypothesis" means here.

If we are referring to the redundant information hypothesis and various kinds of selection theorems, this is a mathematical framework that could end up being correct, is not at all ungrounded, and Wentworth sure seems like the man for the job. 

But then you are still left with the task of grounding this framework in physical reality to allow you to make correct empirical predictions about and real-world interventions on what you will see from more advanced models. Our physical world abstracting well seems plausible (not necessarily >50% likely), and these abstractions being "natural" (e.g., in a category-theoretic sense) seems likely conditional on the first clause of this sentence being true, but I give an extremely low probability to the idea that these abstractions will be used by any given general intelligence or (more to the point) advanced AI model to a large and wide enough extent that retargeting the search is even close to possible.

And indeed, it is the latter question that represents the make-or-break moment for natural abstractions' theory of change, for it is only when the model in front of you (as opposed to some other idealized model) uses these specific abstractions that you can look through the AI's internal concepts and find your desired alignment target.

Rohin Shah has already explained the basic reasons why I believe the mesa-optimizer-type search probably won't exist/be findable in the inner workings of the models we encounter: "Search is computationally inefficient relative to heuristics, and we'll be selecting really hard on computational efficiency." And indeed, when I look at the only general intelligences I have ever encountered in my entire existence thus far, namely humans, I see mostly just a kludge of impulses and heuristics that depend very strongly (almost entirely) on our specific architectural make-up and the contextual feedback we encounter in our path through life. Change either of those and the end result shifts massively.

And even moving beyond that, is the concept of the number "three" a natural abstraction? Then I see entire collections and societies of (generally intelligent) human beings today who don't adopt it. Are the notions of "pressure" and "temperature" and "entropy" natural abstractions? I look at all human beings in 1600 and note that not a single one of them had ever correctly conceptualized a formal version of any of those; and indeed, even making a conservative estimate of the human species (with an essentially unchanged modern cognitive architecture) having existed for 200k years, this means that for 99.8% of our species' history, we had no understanding whatsoever of concepts as "universal" and "natural" as that. If you look at subatomic particles like electrons or stuff in quantum mechanics, the percentage manages to get even higher. And that's only conditioning on abstractions about the outside world that we have eventually managed to figure out; what about the other unknown unknowns?

For example, this post does an experiment that shows that OOD data still makes the Platonic Representation Hypothesis true, meaning that it's likely that deeper factors are at play than just shallow similarity

I don't think it shows that at all, since I have not been able to find any analysis of the methodology, data generation, discussion of results, etc. With no disrespect to the author (who surely wasn't intending for his post to be taken as authoritative as a full paper in terms of updating towards his claim), this is shoddy science, or rather not science at all, just a context-free correlation matrix.


Anyway, all this is probably more fit for a longer discussion at some point.

My view of the development of the field of AI alignment is pretty much the exact opposite of yours: theoretical agent foundations research, what you describe as research on the hard parts of the alignment problem, is a castle in the clouds. Only when alignment researchers started experimenting with real-world machine learning models did AI alignment become grounded in reality. The biggest epistemic failure in the history of the AI alignment community was waiting too long to make this transition.

 

Early arguments for the possibility of AI existential risk (as seen, for example, in the Sequences) were largely based on 1) rough analogies, especially to evolution, and 2) simplifying assumptions about the structure and properties of AGI. For example, agent foundations research sometimes assumes that AGI has infinite compute or that it has a strict boundary between its internal decision processes and the outside world.

 

As neural networks started to see increasing success at a wide variety of problems in the mid-2010s, it started to become apparent that the analogies and assumptions behind early AI x-risk cases didn't apply to them. The process of developing an ML model isn't very similar to evolution. Neural networks use finite amounts of compute, have internals that can be probed and manipulated, and behave in ways that can't be rounded off to decision theory. On top of that, it became increasingly clear as the deep learning revolution progressed that even if agent foundations research did deliver accurate theoretical results, there was no way to put them into practice.

 

But many AI alignment researchers stuck with the agent foundations approach for a long time after their predictions about the structure and behavior of AI failed to come true. Indeed, the late-2000s AI x-risk arguments still get repeated sometimes, like in List of Lethalities. It's telling that the OP uses worst-case ELK as an example of one of the hard parts of the alignment problem; the framing of the worst-case ELK problem doesn't make any attempt to ground the problem in the properties of any AI system that could plausibly exist in the real world, and instead explicitly rejects any such grounding as not being truly worst-case.

 

Why have ungrounded agent foundations assumptions stuck around for so long? There are a couple factors that are likely at work:

  • Agent foundations nerd-snipes people. Theoretical agent foundations is fun to speculate about, especially for newcomers or casual followers of the field, in a way that experimental AI alignment isn't. There's much more drudgery involved in running an experiment. This is why I, personally, took longer than I should have to abandon the agent foundations approach.
  • Game-theoretic arguments are what motivated many researchers to take the AI alignment problem seriously in the first place. The sunk cost fallacy then comes into play: if you stop believing that game-theoretic arguments for AI x-risk are accurate, you might conclude that all the time you spent researching AI alignment was wasted. 

 

Rather than being an instance of the streetlight effect, the shift to experimental research on AI alignment was an appropriate response to developments in the field of AI as it left the GOFAI era. AI alignment research is now much more grounded in the real world than it was in the early 2010s.

You do realize that by "alignment", the OP (John) is not taking about techniques that prevent an AI that is less generally capable than a capable person from insulting the user or expressing racist sentiments?

We seek a methodology for constructing an AI that either ensures that the AI turns out not to be able to easily outsmart us or (if it does turn out to be able to easily outsmart us) ensures (or makes it unlikely) that it won't kill us all or do something other terrible thing. (The former is not researched much compared to the latter, but I felt the need to include it for completeness.)

The way it is now, it is not even clear whether you and the OP (John) are talking about the same thing (because "alignment" has come to have a broad meaning).

If you want to continue the conversation, it would help to know whether you see a pressing need for a methodology of the type I describe above. (Many AI researchers do not: they think that outcomes like human extinction are quite unlikely or at least easy to avoid.)

Thank you for writing this post. I'm probably slightly more optimistic than you on some of the streetlighting approaches, but I've also been extremely frustrated that we don't have anything better, when we could.

That means I'm going to have to speculate about how lots of researchers are being stupid internally, when those researchers themselves would probably say that they are not being stupid at all and I'm being totally unfair.

I've seen discussions from people who I vehemently disagreed that did similar things and felt very frustrated by not being able to defend my views with greater bandwidth. This isn't a criticism of this post - I think a non-zero number of those are plausibly good - but: I'd be happy to talk at length with anyone who feels like this post is unfair to them, about our respective views. I likely can't do as good a job as John can (not least because our models aren't identical), but I probably have more energy for talking to alignment researchers[1].

On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportionately STEM-focused undergrads, and looks pretty typical for STEM-focused undergrads. We’re talking about pretty standard STEM majors from pretty standard schools, neither the very high end nor the very low end of the skill spectrum.

... and that's just not a high enough skill level for people to look at the core hard problems of alignment and see footholds.

Who To Recruit Instead

We do not need pretty standard STEM-focused undergrads from pretty standard schools. In practice, the level of smarts and technical knowledge needed to gain any traction on the core hard problems seems to be roughly "physics postdoc". Obviously that doesn't mean we exclusively want physics postdocs - I personally have only an undergrad degree, though amusingly a list of stuff I studied has been called "uncannily similar to a recommendations to readers to roll up their own doctorate program". Point is, it's the rough level of smarts and skills which matters, not the sheepskin. (And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)

I disagree on two counts. I think people simply not thinking about what it would take to make superintelligent AI go well is a much, much bigger and more common cause for failure than the others, including flinching away. Getting traction on hard problems would solve the problem only if there weren't even easier-traction (or more interesting) problems that don't help. Very anecdotally, I've talked to some extremely smart people who I would guess are very good at making progress on hard problems, but just didn't think too hard about what solutions help.

I think the skills to do that may be correlated with physics PhDs, but more weakly. I don't think recruiting smart undergrads was a big mistake for that reason. Then again, I only have weak guesses as to what things you should actually select for such that you get people with these skills - there's still definitely failure modes like people who find the hard problems, and aren't very good at making traction on them (or people who overshoot on finding the hard problem and work on something nebuluously hard).

My guess would be that a larger source of "what went wrong" follows from incentives like "labs doing very prosaic alignment / interpretability / engineering-heavy work" -> "selecting for people who are very good engineers or the like" -> "selects for people who can make immediate progress on hand-made problems without having to spend a lot of time thinking about what broad directions to work on or where locally interesting research problems are not-great for superintelligent AI".

  1. ^

    In the past I've done this much more adversarially than I'd have liked, so if you're someone who was annoyed at having such a conversation with me before - I promise I'm trying to be better about that.

Robin Hanson recently wrote about two dynamics that can emerge among individuals within an organisations when working as a group to reach decisions. These are the "outcome game" and the "consensus game."

In the outcome game, individuals aim to be seen as advocating for decisions that are later proven correct. In contrast, the consensus game focuses on advocating for decisions that are most immediately popular within the organization. When most participants play the consensus game, the quality of decision-making suffers.

The incentive structure within an organization influences which game people play. When feedback on decisions is immediate and substantial, individuals are more likely to engage in the outcome game. Hanson argues that capitalism's key strength is its ability to make outcome games more relevant.

However, if an organization is insulated from the consequences of its decisions or feedback is delayed, playing the consensus game becomes the best strategy for gaining resources and influence. 

This dynamic is particularly relevant in the field of (existential) AI Safety, which needs to develop strategies to mitigate risks before AGI is developed. Currently, we have zero concrete feedback about which strategies can effectively align complex systems of equal or greater intelligence to humans.

As a result, it is unsurprising that most alignment efforts avoid tackling seemingly intractable problems. The incentive structures in the field encourage individuals to play the consensus game instead. 

[-]Tahp170

I am a physics PhD student. I study field theory. I have a list of projects I've thrown myself at with inadequate technical background (to start with) and figured out. I've convinced a bunch of people at a research institute that they should keep giving me money to solve physics problems. I've been following LessWrong with interest for years. I think that AI is going to kill us all, and would prefer to live for longer if I can pull it off. So what do I do to see if I have anything to contribute to alignment research? Maybe I'm flattering myself here, but I sound like I might be a person of interest for people who care about the pipeline. I don't feel like a great candidate because I don't have any concrete ideas for AI research topics to chase down, but it sure seems like I might start having ideas if I worked on the problem with somebody for a bit. I'm apparently very ok with being an underpaid gopher to someone with grand theoretical ambitions while I learn the material necessary to come up with my own ideas. My only lead to go on is "go look for something interesting in MATS and apply to it" but that sounds like a great way to end up doing streetlight research because I don't understand the field. Ideally, I guess I would have whatever spark makes people dive into technical research in a pretty low-status field for no money for long enough to produce good enough research which convinces people to pay their rent while they keep doing more, but apparently the field can't find enough of those that it's unwilling to look for other options.

I know what to do to keep doing physics research. My TA assignment effectively means that I have a part-time job teaching teenagers how to use Newton's laws so I can spend twenty or thirty hours a week coding up quark models. I did well on a bunch of exams to convince an institution that I am capable of the technical work required to do research (and, to be fair, I provide them with 15 hours per week of below-market-rate intellectual labor which they can leverage into tuition that more than pays my salary), so now I have a lot of flexibility to just drift around learning about physics I find interesting while they pay my rent. If someone else is willing to throw 30,000 dollars per year at me to think deeply about AI and get nowhere instead of thinking deeply about field theory to get nowhere, I am not aware of them. Obviously the incentives are perverse to just go around throwing money at people who might be good at AI research, so I'm not surprised that I've only found one potential money spigot for AI research, but I had so many to choose from for physics.

[-]Buck110

Going to MATS is also an opportunity to learn a lot more about the space of AI safety research, e.g. considering the arguments for different research directions and learning about different opportunities to contribute. Even if the "streetlight research" project you do is kind of useless (entirely possible), doing MATS is plausibly a pretty good option.

It sounds like you should apply for the PIBBSS Fellowship! (https://pibbss.ai/fellowship/)

Putting venues aside, I'd like to build software (like AI-aided) to make it easier for the physics post-docs to onboard to the field and focus on the 'core problems' in ways that prevent recoil as much as possible. One worry I have with 'automated alignment'-type things is that it similarly succumbs to the streetlight effect due to models and researchers having biases towards the types of problems you mention. By default, the models will also likely just be much better at prosaic-style safety than they will be at the 'core problems'. I would like to instead design software that makes it easier to direct their cognitive labour towards the core problems.

I have many thoughts/ideas about this, but I was wondering if anything comes to mind for you beyond 'dedicated venues' and maybe writing about it.

This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post.

Yep. This post is not for me but I'll say a thing that annoyed me anyway:

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems.

Does this actually happen? (Even if you want to be maximally cynical, I claim presenting novel important difficulties (e.g. "sensor tampering") or giving novel arguments that problems are difficult is socially rewarded.)

One particular way this issue could be ameliorated is by encouraging people to write up null results/negative results, and one part of your model here is that a null result doesn't get reported and thus other people don't hear about failure, while people do hear about success stories, meaning that there is a selection effect to work on successful programs, and no one hears about the failures to tackle the problem, which is bad for research culture, and negative results not being shown is a universal problem across fields.

I think this lens of incentives and the "flinching away" concept are extremely valuable for understanding the field of alignment (and less importantly, everything else:). 

I believe "flinching away" is the psychological tendency that creates bigger and more obvious-on-inspection "ugh fields". I think this is the same underlying mechanism discussed as valence by Steve Byrnes.  Motivated reasoning is the name for the resulting cognitive bias.  Motivated reasoning overlaps by experimental definition with confirmation bias, the one bias destroying society in Scott Alexander's terms. After studying cognitive biases through the lens of neuroscience for years, I th nk motivated reasoning is severely hampering progress in alignment, as it is every other project. I have written about it a little in what is the most important cognitive bias to understand, but I want to address more thoroughly how it impacts alignment research. 

This post makes a great start at addressing how that's happening.

I very much agree with the analysis of incentives given here: they are strongly toward tangible and demonstrable progress in any direction vaguely related to the actual problem at hand.

This is a largely separate topic, but I happen to agree that we probably need more experienced thinkers. I  disagree that physicists are obviously the best sort of experienced thinkers. I have been a physicist (as an undergrad) and I have watched physicists get into other fields. Their contributions are valuable but far from the final word and are far better when they inspire or collaborate with others with real knowledge of the target field.

There is much more to say on incentives and the field as a whole, but the remainder deserves more careful thought and separate posts.

This analysis of biases and "flinching away" could be applied to many other approaches than the prosaic alignment you target here. I think you're correct to notice this about prosaic alignment, but it applies to many agent foundations approaches as well. 

A relentless focus on the problem at hand, including its most difficult aspects, is absolutely crucial. Those difficult aspects include the theoretical concerns you link to up front, which prosaic alignment largely fails to address. But the difficult spots also include the inconvenient fact that the world is rushing toward building LLM-based or at least deep net based AGI very rapidly, and there are no good ideas about how to make them stop while we go look in a distant but more promising spot to find some keys. Most agent foundations work seems to flinch away from this aspect.  Both broad schools largely flinch away from the social, political, and economic aspects of the problem. 

 

We are a lens that can see its flaws, but we need to work to see them clearly. This difficult self-critique of locating our flinches and ugh fields is what we all as individuals, and the field as a collective, need to do to see clearly and speed up progress.

If you wanted to create such a community, you could try spinning up a Discord server?

I'm not saying that this would necessarily be a step in the wrong direction, but I don't think think a discord server is capable of fixing a deeply entrenched cultural problem among safety researchers.
 

If moderating the server takes up a few hours of John's time per week the opportunity cost probably isn't worth it. 

A few thoughts.

  1. Have you checked what happens when you throw physic postdocs at the core issues - do they actually get traction or just stare at the sheer cliff for longer while thinking? Did anything come out of the Illiad meeting half a year later? Is there a reason that more standard STEMs aren't given an intro into some of the routes currently thought possibly workable, so they can feel some traction? I think either could be true- that intelligence and skills aren't actually useful right now, the problem is not tractable, or better onboarding could let the current talent pool get traction - and either way it might not be very cost effective to get physics postdocs involved.

  2. Humans are generally better at doing things when they have more tools available. While the 'hard bits' might be intractable now, they could well be easier to deal with in a few years after other technical and conceptual advances in AI, and even other fields. (Something something about prompt engineering and Anthropic's mechanistic interpretability from inside the field and practical quantum computing outside).

This would mean squeezing every drop of usefulness out of AI at each level of capability, to improve general understanding and to leverage it into breakthroughs in other fields before capabilities increase further. In fact, it might be best to sabotage semiconductor/chip production once the models one gen before super-intelligence/extinction/ whatever, giving maximum time to leverage maximum capabilities and tackle alignment before the AIs get too smart.

  1. How close is mechanistic interpretability to the hard problems, and what makes it not good enough?