Comment Permalink

[anonymous]4mo7013

(Prefatory disclaimer that, admittedly as an outsider to this field, I absolutely disagree with the labeling of prosaic AI work as useless streetlighting, for reasons building upon what many commenters wrote in response to the very posts you linked here as assumed background material. But in the spirit of your post, I shall ignore that moving forward.)

The "What to Do About It" section dances around but doesn't explicitly name one of the core challenges of theoretical agent-foundations work that aims to solve the "hard bits" of the alignment challenge, namely the seeming lack of reliable feedback loops that give you some indication that you are pushing towards something practically useful in the end instead of just a bunch of cool math that nonetheless resides alone in its separate magisterium. As Conor Leahy concisely put it:

Humans are really, really bad at doing long chains of abstract reasoning without regular contact with reality, so in practice imo good philosophy has to have feedback loops with reality, otherwise you will get confused.

He was talking about philosophy in particular at that juncture, in response to Wei Dai's concerns over metaphilosophical competence, but this point seems to me to generalize to a whole bunch of other areas as well. Indeed, I have talked about this before.

... and in my experience, there are people who can get traction on the core hard problems. Most notably physicists - when they grok the hard parts, they tend to immediately see footholds, rather than a blank impassable wall.

Do they get traction on "core hard problems" because of how Inherently Awesome they are as researchers, or do they do so because the types of physics problems we mostly care about currently are such that, while the generation of (worthwhile) grand mathematical theories is hard, verifying them is (comparatively) easier because we can run a bunch of experiments (or observe astronomical data etc., in the super-macro scale) to see if the answers they spit out comply with reality? I am aware of your general perspective on this matter, but I just... still completely disagree, for reasons other people have pointed out (see also Vanessa Kosoy's comment here). Is this also supposed to be an implicitly assumed bit of background material?

And when we don't have those verifying experiments at hand, do we not get stuff like string theory, where the math is beautiful and exquisite (in the domains it has been extended do) but debate by "physics postdocs" over whether it's worthwhile to keep funding and pursuing it keeps raging on as a Theory of Everything keeps eliding our grasp? I'm sure people with more object-level expertise on this can correct my potential misconceptions if need be.

Idk man, some days I'm half-tempted to believe that all non-prosaic alignment work is a bunch of "streetlighting." Yeah, it doesn't result in the kind of flashy papers full of concrete examples about current models that typically get associated with the term-in-scare-quotes. But it sure seems to cover itself in a veneer of respectability by giving a (to me) entirely unjustified appearance of rigor and mathematical precision and robustness to claims about what will happen in the real world based on nothing more than a bunch of vibing about toy models that assume away the burdensome real-world details serving as evidence whether the approaches are even on the right track. A bunch of models that seem both woefully underpowered for the Wicked Problems they must solve and also destined to underfit their target, for they (currently) all exist and supposedly apply independently of the particular architecture, algorithms, training data, scaffolding etc., that will result in the first patch of really powerful AIs. The contents and success stories of Vanessa Kosoy's desiderata, or of your own search for natural abstractions, or of Alex Altair's essence of agent foundations, or of Orthogonal's QACI, etc., seem entirely insensitive to the fact that we are currently dealing with multimodal LLMs combined with RL instead of some other paradigm, which in my mind almost surely disqualifies them as useful-in-the-real-world when the endgame hits.

There's a famous Eliezer quote about how for every correct answer to a precisely-stated problem, there are a million times more wrong answers one could have given instead. I would build on that to say that for every powerfully predictive, but lossy and reductive mathematical model of a complex real-world system, there are a million times more similar-looking mathematical models that fail to capture the essence of the problem and ultimately don't generalize well at all. And it's only by grounding yourself to reality and hugging the query tight by engaging with real-world empirics that you can figure out if the approach you've chosen is in the former category as opposed to the latter.

(I'm briefly noting that I don't fully endorse everything I said in the previous 2 paragraphs, and I realize that my framing is at least a bit confrontational and unfair. Separately, I acknowledge the existence of arguably-non-prosaic and mostly theoretical alignment approaches like davidad's Open Agency Architecture, CHAI's CIRL and utility uncertainty, Steve Byrnes's work on brain-like AGI safety, etc., that don't necessarily appear to fit this mold. I have varying opinions on the usefulness and viability of such approaches.)

Showing 3 of 14 replies (Click to show all)

8Rohin Shah4mo

Re: (1), if you look through the thread for the comment of mine that was linked above, I respond to top-down heuristical-constraint-based search as well. I agree the response is different and not just "computational inefficiency". Re: (2), I agree that near-future systems will be easily retargetable by just changing the prompt or the evaluator function (this isn't new to the o-series, you can also "retarget" any LLM chatbot by giving it a different prompt). If this continues to superintelligence, I would summarize it as "it turns out alignment wasn't a problem" (e.g. scheming never arose, we never had problems with LLMs exploiting systematic mistakes in our supervision, etc). I'd summarize this as "x-risky misalignment just doesn't happen by default", which I agree is plausible (see e.g. here), but when I'm talking about the viability of alignment plans like "retarget the search" I generally am assuming that there is some problem to solve. (Also, random nitpick, who is talking about inference runs of billions of dollars???)

4Thane Ruthenis4mo

1. Yup, I read through it after writing the previous response and now see that you don't need to be convinced of that point. Sorry about dragging you into this. 2. I could nitpick the details here, but I think the discussion has kind of wandered away from any pivotal points of disagreement, plus John didn't want object-level arguments under this post. So I petition to leave it at that. There's a log-scaling curve, OpenAI have already spent on the order of a million dollars just to score well on some benchmarks, and people are talking about "how much would you be willing to pay for the proof of the Riemann Hypothesis?". It seems like a straightforward conclusion that if o-series/inference-time scaling works as well as ML researchers seem to hope, there'd be billion-dollar inference runs funded by some major institutions.

Rohin Shah4mo62

OpenAI have already spent on the order of a million dollars just to score well on some benchmarks

Note this is many different inference runs each of which was thousands of dollars. I agree that people will spend billions of dollars on inference in total (which isn't specific to the o-series of models). My incredulity was at the idea of spending billions of dollars on a single episode, which is what I thought you were talking about given that you were talking about capability gains from scaling up inference-time compute.

See in context

295 The Field of AI Alignment: A Postmortem, and What To Do About It

by johnswentworth

26th Dec 2024

9 min read

160

295

A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is".

Over the past few years, a major source of my relative optimism on AI has been the hope that the field of alignment would transition from pre-paradigmatic to paradigmatic, and make much more rapid progress.

At this point, that hope is basically dead. There has been some degree of paradigm formation, but the memetic competition has mostly been won by streetlighting: the large majority of AI Safety researchers and activists are focused on searching for their metaphorical keys under the streetlight. The memetically-successful strategy in the field is to tackle problems which are easy, rather than problems which are plausible bottlenecks to humanity’s survival. That pattern of memetic fitness looks likely to continue to dominate the field going forward.

This post is on my best models of how we got here, and what to do next.

What This Post Is And Isn't, And An Apology

This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post. In particular, probably the large majority of people in the field have some story about how their work is not searching under the metaphorical streetlight, or some reason why searching under the streetlight is in fact the right thing for them to do, or [...].

The kind and prosocial version of this post would first walk through every single one of those stories and argue against them at the object level, to establish that alignment researchers are in fact mostly streetlighting (and review how and why streetlighting is bad). Unfortunately that post would be hundreds of pages long, and nobody is ever going to get around to writing it. So instead, I'll link to:

Eliezer's List O' Doom
My own Why Not Just... sequence
Nate's How Various Plans Miss The Hard Bits Of The Alignment Challenge

(Also I might link some more in the comments section.) Please go have the object-level arguments there rather than rehashing everything here.

Next comes the really brutally unkind part: the subject of this post necessarily involves modeling what's going on in researchers' heads, such that they end up streetlighting. That means I'm going to have to speculate about how lots of researchers are being stupid internally, when those researchers themselves would probably say that they are not being stupid at all and I'm being totally unfair. And then when they try to defend themselves in the comments below, I'm going to say "please go have the object-level argument on the posts linked above, rather than rehashing hundreds of different arguments here". To all those researchers: yup, from your perspective I am in fact being very unfair, and I'm sorry. You are not the intended audience of this post, I am basically treating you like a child and saying "quiet please, the grownups are talking", but the grownups in question are talking about you and in fact I'm trash talking your research pretty badly, and that is not fair to you at all.

But it is important, and this post just isn't going to get done any other way. Again, I'm sorry.

Why The Streetlighting?

A Selection Model

First and largest piece of the puzzle: selection effects favor people doing easy things, regardless of whether the easy things are in fact the right things to focus on. (Note that, under this model, it's totally possible that the easy things are the right things to focus on!)

What does that look like in practice? Imagine two new alignment researchers, Alice and Bob, fresh out of a CS program at a mid-tier university. Both go into MATS or AI Safety Camp or get a short grant or [...]. Alice is excited about the eliciting latent knowledge (ELK) doc, and spends a few months working on it. Bob is excited about debate, and spends a few months working on it. At the end of those few months, Alice has a much better understanding of how and why ELK is hard, has correctly realized that she has no traction on it at all, and pivots to working on technical governance. Bob, meanwhile, has some toy but tangible outputs, and feels like he's making progress.

... of course (I would say) Bob has not made any progress toward solving any probable bottleneck problem of AI alignment, but he has tangible outputs and is making progress on something, so he'll probably keep going.

And that's what the selection pressure model looks like in practice. Alice is working on something hard, correctly realizes that she has no traction, and stops. (Or maybe she just keeps spinning her wheels until she burns out, or funders correctly see that she has no outputs and stop funding her.) Bob is working on something easy, he has tangible outputs and feels like he's making progress, so he keeps going and funders keep funding him. How much impact Bob's work has impact on humanity's survival is very hard to measure, but the fact that he's making progress on something is easy to measure, and the selection pressure rewards that easy metric.

Generalize this story across a whole field, and we end up with most of the field focused on things which are easy, regardless of whether those things are valuable.

Selection and the Labs

Here's a special case of the selection model which I think is worth highlighting.

Let's start with a hypothetical CEO of a hypothetical AI lab, who (for no particular reason) we'll call Sam. Sam wants to win the race to AGI, but also needs an AI Safety Strategy. Maybe he needs the safety strategy as a political fig leaf, or maybe he's honestly concerned but not very good at not-rationalizing. Either way, he meets with two prominent AI safety thinkers - let's call them (again for no particular reason) Eliezer and Paul. Both are clearly pretty smart, but they have very different models of AI and its risks. It turns out that Eliezer's model predicts that alignment is very difficult and totally incompatible with racing to AGI. Paul's model... if you squint just right, you could maybe argue that racing toward AGI is sometimes a good thing under Paul's model? Lo and behold, Sam endorses Paul's model as the Official Company AI Safety Model of his AI lab, and continues racing toward AGI. (Actually the version which eventually percolates through Sam's lab is not even Paul's actual model, it's a quite different version which just-so-happens to be even friendlier to racing toward AGI.)

A "Flinching Away" Model

While selection for researchers working on easy problems is one big central piece, I don't think it fully explains how the field ends up focused on easy things in practice. Even looking at individual newcomers to the field, there's usually a tendency to gravitate toward easy things and away from hard things. What does that look like?

Carol follows a similar path to Alice: she's interested in the Eliciting Latent Knowledge problem, and starts to dig into it, but hasn't really understood it much yet. At some point, she notices a deep difficulty introduced by sensor tampering - in extreme cases it makes problems undetectable, which breaks the iterative problem-solving loop, breaks ease of validation, destroys potential training signals, etc. And then she briefly wonders if the problem could somehow be tackled without relying on accurate feedback from the sensors at all. At that point, I would say that Carol is thinking about the real core ELK problem for the first time.

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems. At that point, I would say that Carol is streetlighting.

It's the reflexive flinch which, on this model, comes first. After that will come rationalizations. Some common variants:

Carol explicitly introduces some assumption simplifying the problem, and claims that without the assumption the problem is impossible. (Ray's workshop on one-shotting Baba Is You levels apparently reproduced this phenomenon very reliably.)
Carol explicitly says that she's not trying to solve the full problem, but hopefully the easier version will make useful marginal progress.
Carol explicitly says that her work on easier problems is only intended to help with near-term AI, and hopefully those AIs will be able to solve the harder problems.
(Most common) Carol just doesn't think about the fact that the easier problems don't really get us any closer to aligning superintelligence. Her social circles act like her work is useful somehow, and that's all the encouragement she needs.

... but crucially, the details of the rationalizations aren't that relevant to this post. Someone who's flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they'll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.

Which brings us to the "what to do about it" part of the post.

What To Do About It

Let's say we were starting a new field of alignment from scratch. How could we avoid the streetlighting problem, assuming the models above capture the core gears?

First key thing to notice: in our opening example with Alice and Bob, Alice correctly realized that she had no traction on the problem. If the field is to be useful, then somewhere along the way someone needs to actually have traction on the hard problems.

Second key thing to notice: if someone actually has traction on the hard problems, then the "flinching away" failure mode is probably circumvented.

So one obvious thing to focus on is getting traction on the problems.

... and in my experience, there are people who can get traction on the core hard problems. Most notably physicists - when they grok the hard parts, they tend to immediately see footholds, rather than a blank impassable wall. I'm picturing here e.g. the sort of crowd at the ILIAD conference; these were people who mostly did not seem at risk of flinching away, because they saw routes to tackle the problems. (Though to be clear, though ILIAD was a theory conference, I do not mean to imply that it's only theorists who ever have any traction.) And they weren't being selected away, because many of them were in fact doing work and making progress.

Ok, so if there are a decent number of people who can get traction, why do the large majority of the people I talk to seem to be flinching away from the hard parts?

How We Got Here

The main problem, according to me, is the EA recruiting pipeline.

On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportionately STEM-focused undergrads, and looks pretty typical for STEM-focused undergrads. We’re talking about pretty standard STEM majors from pretty standard schools, neither the very high end nor the very low end of the skill spectrum.

... and that's just not a high enough skill level for people to look at the core hard problems of alignment and see footholds.

Who To Recruit Instead

We do not need pretty standard STEM-focused undergrads from pretty standard schools. In practice, the level of smarts and technical knowledge needed to gain any traction on the core hard problems seems to be roughly "physics postdoc". Obviously that doesn't mean we exclusively want physics postdocs - I personally have only an undergrad degree, though amusingly a list of stuff I studied has been called "uncannily similar to a recommendation to readers to roll up their own doctorate program". Point is, it's the rough level of smarts and skills which matters, not the sheepskin. (And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)

As an alternative to recruiting people who have the skills already, one could instead try to train people. I've tried that to some extent, and at this point I think there just isn't a substitute for years of technical study. People need that background knowledge in order to see footholds on the core hard problems.

Integration vs Separation

Last big piece: if one were to recruit a bunch of physicists to work on alignment, I think it would be useful for them to form a community mostly-separate from the current field. They need a memetic environment which will amplify progress on core hard problems, rather than... well, all the stuff that's currently amplified.

This is a problem which might solve itself, if a bunch of physicists move into alignment work. Heck, we've already seen it to a very limited extent with the ILIAD conference itself. Turns out people working on the core problems want to talk to other people working on the core problems. But the process could perhaps be accelerated a lot with more dedicated venues.

AI Alignment FieldbuildingAI

Frontpage

295

Mentioned in

117The Plan - 2024 Update

97Reasons for and against working on technical AI safety at a frontier AI lab

45AI #97: 4

The Field of AI Alignment: A Postmortem, and What To Do About It

9Alexander Gietelink Oldenziel

50Alexander Gietelink Oldenziel

17the gears to ascension

New Comment

160 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:16 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]aysja4mo16183

I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess. I do think it takes an unusual skillset, though, which is where most of the trouble lives. I.e., I think the pre-paradigmatic skillset requires unusually strong epistemics (because you often need to track for yourself what makes sense), ~creativity (the ability to synthesize new concepts, to generate genuinely novel hypotheses/ideas), good ability to traverse levels of abstraction (connecting details to large level structure, this is especially important for the alignment problem), not being efficient market pilled (you have to believe that more is possible in order to aim for it), noticing confusion, and probably a lot more that I’m failing to name here.

Most importantly, though, I think it requires quite a lot of willingness to remain confused. Many scientists who accomplished great things (Darwin, Einstein) didn’t have publishable results on their main inquiry for years. Einstein, for instance, talks about wandering off for weeks in a state of “psychic tension” in his youth, it took ~ten years to go from his first inkling of ... (read more)

[-]Raemon4mo182

This is the sort of thing I find appealing to believe, but I feel at least somewhat skeptical of. I notice a strong emotional pull to want this to be true (as well as an interesting counterbalancing emotional pull for it to not be true).

I don't think I've seen output from the people aspiring in this direction without being visibly quite smart to make me think "okay yeah it seems like it's on track in some sense."

I'd be interested in hearing more explicit cruxes from you about it.

I do think it's plausible than the "smart enough, creative enough, strong epistemics, independent, willing to spend years without legible output, exceptionally driven, and so on" are sufficient (if you're at least moderately-but-not-exceptionally-smart). Those are rare enough qualities that it doesn't necessarily feel like I'm getting a free lunch, if they turn out to be sufficient for groundbreaking pre-paradigmatic research. I agree the x-risk pipeline hasn't tried very hard to filter for and/or generate people with these qualities.

(well, okay, "smart enough" is doing a lot of work there, I assume from context you mean "pretty smart but not like genius smart")

But, I've only really seen you note... (read more)

[-]johnswentworth4mo144

I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess.

To be clear, I wasn't talking about physics postdocs mainly because of raw g. Raw g is a necessary element, and physics postdocs are pretty heavily loaded on it, but I was talking about physics postdocs mostly because of the large volume of applied math tools they have.

The usual way that someone sees footholds on the hard parts of alignment is to have a broad enough technical background that they can see some analogy to something they know about, and try borrowing tools that work on that other thing. Thus the importance of a large volume of technical knowledge.

7JuliaHP4mo

Curious about what it would look like to pick up the relevant skills, especially the subtle/vague/tacit skills, in an independent-study setting rather than in academia. As well as the value of doing this, IE maybe its just a stupid idea and its better to just go do a PhD. Is the purpose of a PhD to learn the relevant skills, or to filter for them? (If you have already written stuff which suffices as a response, id be happy to be pointed to the relevant bits rather than having them restated) "Broad technical knowledge" should be in some sense the "easiest" (not in terms of time-investment, but in terms of predictable outcomes), by reading lots of textbooks (using similar material as your study guide). Writing/communication, while more vague, should also be learnable by just writing a lot of things, publishing them on the internet for feedback, reflecting on your process etc. Something like "solving novel problems" seems like a much "harder" one. I don't know if this is a skill with a simple "core" or a grab-bag of tactics. Textbook problems take on a "meant-to-be-solved" flavor and I find one can be very good at solving these without being good at tackling novel problems. Another thing I notice is that when some people (myself included) try solving novel problems, we can end up on a path which gets there eventually, but if given "correct" feedback integration would go OOM faster. I'm sure there are other vague-skills which one ends up picking up from a physics PhD. Can you name others, and how one picks them up intentionally? Am I asking the wrong question?

[-]johnswentworth4mo142

I currently think broad technical knowledge is the main requisite, and I think self-study can suffice for the large majority of that in principle. The main failure mode I see would-be autodidacts run into is motivation, but if you can stay motivated then there's plenty of study materials.

For practice solving novel problems, just picking some interesting problems (preferably not AI) and working on them for a while is a fine way to practice.

9Johannes C. Mayer4mo

Why not AI? Is it that AI alignment is too hard? Or do you think it's likely one would fall into the "try a bunch of random stuff" paradigm popular in AI, which wouldn't help much in getting better at solving hard problems? What do you think about the strategy of instead of learning a textbook e.g. on information theory, or compilers you try to write the textbook and only look at existing material if you are really stuck. That's my primary learning strategy. It's very slow and I probably do it too much, but it allows me to train to solve hard problems that aren't super hard. If you read all the text books all the practice problems remaining are very hard.

7JuliaHP4mo

(That broad technical knowledge is the main thing (as opposed to tacit skills) why you value a physics PhD is a really surprising response to me, and seems like an important part of the model that didn't come across from the post.)

[-]Seth Herd4mo145

I think this is right. A couple of follow-on points:

There's a funding problem if this is an important route to progress. If good work is illegible for years, it's hard to decide who to fund, and hard to argue for people to fund it. I don't have a proposed solution, but I wanted to note this large problem.

Einstein did his pre-paradigmatic work largely alone. Better collaboration might've sped it up.

LessWrong allows people to share their thoughts prior to having publishable journal articles and get at least a few people to engage.

This makes the difficult pre-paradigmatic thinking a group effort instead of a solo effort. This could speed up progress dramatically.

This post and the resulting comments and discussions is an example of the community collectively doing much of the work you describe: traversing levels, practicing good epistemics, and remaining confused.

Having conversations with other LWers (on calls, by DM, or in extended comment threads) is tremendously useful for me. I could produce those same thoughts and critiques, but it would take me longer to arrive at all of those different viewpoints of the issue. I mention this to encourage others to do it. Communication takes time... (read more)

5Linda Linsefors4mo

I think this is false. As I remember hearing the story, he where corresponding with several people via letters.

9Steven Byrnes4mo

I know very little, but there’s a fun fact here: “During their lifetimes, Darwin sent at least 7,591 letters and received 6,530; Einstein sent more than 14,500 and received more than 16,200.” (Not sure what fraction was technical vs personal.) Also, this is a brief summary of Einstein’s mathematician friend Marcel Grossmann’s role in general relativity.

7Seth Herd4mo

In the piece you linked, it sounds like Einstein had the correct geometry for general relativity one day after he asked for help finding one. Of course, that's one notable success amongst perhaps a lot of collaboration. The number of letters he sent and received implies that he actually did a lot of written collaboration. I wonder about the value of real-time conversation vs. written exchanges. And the value of being fully engaged; truly curious about your interlocutor's ideas. My own experience watching progress happen (and not-happen) in theoretical neuroscience is that fully engaged conversations with other true experts with different viewpoints was rare and often critical for real progress. My perception is that those conversations are tricky to produce. Experts are often splitting their attention between impressing people and coolheaded, openminded discussion. And they weren't really seeking out these conversations, just having them when it was convenient, and being really fully engaged only when the interpersonal vibe happened to be right. Even so, the bit of real conversation I saw seemed quite important. It would be helpful to understand collaboration on difficult theory better, but it would be a whole research topic.

4Seth Herd4mo

By largely alone I meant without the rich collaboration of having an office in the same campus or phone calls or LessWrong.

5Linda Linsefors4mo

I think the qualitive difference is not as large as you think it is. But I also don't think this is very crux-y for anything, so I will not try to figure out how to translate my reasoning to words, sorry.

4Chris_Leong4mo

Agreed. Simply focusing on physics post-docs feels too narrow to me. Then again, just as John has a particular idea of what good alignment research looks like, I have my own idea: I would lean towards recruiting folk with both a technical and a philosophical background. It's possible that my own idea is just as narrow.

8johnswentworth4mo

The post did explicitly say "Obviously that doesn't mean we exclusively want physics postdocs".

4Chris_Leong4mo

Thanks for clarifying. Still feels narrow as a primary focus.

[-]Vanessa Kosoy4mo*12232

Good post, although I have some misgivings about how unpleasant it must be to read for some people.

One factor not mentioned here is the history of MIRI. MIRI was a pioneer in the field, and it was MIRI who articulated and promoted the agent foundations research agenda. The broad goals of agent foundations^[1] are (IMO) load-bearing for any serious approach to AI alignment. But, when MIRI essentially declared defeat, in the minds of many that meant that any approach in that vein is doomed. Moreover, MIRI's extreme pessimism deflates motivation and naturally produces the thought "if they are right then we're doomed anyway, so might as well assume they are wrong".

Now, I have a lot of respect for Yudkowsky and many of the people who worked at MIRI. Yudkowsky started it all, and MIRI made solid contributions to the field. I'm also indebted to MIRI for supporting me in the past. However, MIRI also suffered from some degree of echo-chamberism, founder-effect-bias, insufficient engagement with prior research (due to hubris), looking for nails instead of looking for hammers, and poor organization^[2].

MIRI made important progress in agent foundations, but also missed an opportunity to do ... (read more)

9Johannes C. Mayer4mo

What are some concrete examples of the of research that MIRI insufficiently engaged with? Are there general categories of prior research that you think are most underutilized by alignment researchers?

[-]Vanessa Kosoy4mo162

Learning theory, complexity theory and control theory. See the "AI theory" section of the LTA reading list.

4joshcohen4mo

MIRI had a lot of "not invented here" mindset. It was pointed out, e.g. [here](https://web.archive.org/web/20170918044233/http://files.openphilanthropy.org/files/Grants/MIRI/consolidated_public_reviews.pdf) but unfortunately the mindset is self-reinforcing so there wasn't much to be done.

1Satron3mo

Your link formatting got messed up here.

[-]leogao4mo10680

I'm sympathetic to most prosaic alignment work being basically streetlighting. However, I think there's a nirvana fallacy going on when you claim that the entire field has gone astray. It's easiest to illustrate what I mean with an analogy to capabilities.

In capabilities land, there were a bunch of old school NLP/CV people who insisted that there's some kind of true essence of language or whatever that these newfangled neural network things weren't tackling. The neural networks are just learning syntax, but not semantics, or they're ungrounded, or they don't have a world model, or they're not representing some linguistic thing, so therefore we haven't actually made any progress on true intelligence or understanding etc etc. Clearly NNs are just progress on the surface appearance of intelligence while actually just being shallow pattern matching, so any work on scaling NNs is actually not progress on intelligence at all. I think this position has become more untenable over time. A lot of people held onto this view deep into the GPT era but now even the skeptics have to begrudgingly admit that NNs are pretty big progress even if additional Special Sauce is needed, and that the other ... (read more)

[-]johnswentworth4mo170

I think you have two main points here, which require two separate responses. I'll do them opposite the order you presented them.

Your second point, paraphrased: 90% of anything is crap, that doesn't mean there's no progress. I'm totally on board with that. But in alignment today, it's not just that 90% of the work is crap, it's that the most memetically successful work is crap. It's not the raw volume of crap that's the issue so much as the memetic selection pressures.

Your first point, paraphrased: progress toward the the hard problem does not necessarily immediately look like tackling the meat of the hard problem directly. I buy that to some extent, but there are plenty of cases where we can look at what people are doing and see pretty clearly that it is not progress toward the hard problem, whether direct or otherwise. And indeed, I would claim that prosaic alignment as a category is a case where people are not making progress on the hard problems, whether direct or otherwise. In particular, one relevant criterion to look at here is generalizability: is the work being done sufficiently general/robust that it will still be relevant once the rest of the problem is solved (and multiple things change in not-yet-predictable ways in order to solve the rest of the problem)? See e.g. this recent comment for an object-level example of what I mean.

[-]leogao4mo312

in capabilities, the most memetically successful things were for a long time not the things that actually worked. for a long time, people would turn their noses at the idea of simply scaling up models because it wasn't novel. the papers which are in retrospect the most important did not get that much attention at the time (e.g gpt2 was very unpopular among many academics; the Kaplan scaling laws paper was almost completely unnoticed when it came out; even the gpt3 paper went under the radar when it first came out.)

one example of a thing within prosaic alignment that i feel has the possibility of generalizability is interpretability. again, if we take the generalizability criteria and map it onto the capabilities analogy, it would be something like scalability - is this a first step towards something that can actually do truly general reasoning, or is it just a hack that will no longer be relevant once we discover the truly general algorithm that subsumes the hacks? if it is on the path, can we actually shovel enough compute into it (or its successor algorithms) to get to agi in practice, or do we just need way more compute than is practical? and i think at the time of gpt2 these were completely unsettled research questions! it was actually genuinely unclear whether writing articles about ovid's unicorn is a genuine first step towards agi, or just some random amusement that will fade into irrelevancy. i think interp is in a similar position where it could work out really well and eventually become the thing that works, or it could just be a dead end.

[-]johnswentworth4mo10-2

If you're thinking mainly about interp, then I basically agree with what you've been saying. I don't usually think of interp as part of "prosaic alignment", it's quite different in terms of culture and mindset and it's much closer to what I imagine a non-streetlight-y field of alignment would look like. 90% of it is crap (usually in streetlight-y ways), but the memetic selection pressures don't seem too bad.

If we had about 10x more time than it looks like we have, then I'd say the field of interp is plausibly on track to handle the core problems of alignment.

[-]leogao4mo2217

ok good that we agree interp might plausibly be on track. I don't really care to argue about whether it should count as prosaic alignment or not. I'd further claim that the following (not exhaustive) are also plausibly good (I'll sketch each out for the avoidance of doubt because sometimes people use these words subtly differently):

model organisms - trying to probe the minimal sets of assumptions to get various hypothesized spicy alignment failures seems good. what is the least spoonfed demonstration of deceptive alignment we can get that is analogous mechanistically to the real deal? to what extent can we observe early signs of the prerequisites in current models? which parts of the deceptive alignment arguments are most load bearing?
science of generalization - in practice, why do NNs sometimes generalize and sometimes not? why do some models generalize better than others? In what ways are humans better or worse than NNs at generalizing? can we understand this more deeply without needing mechanistic understanding? (all closely related to ELK)
goodhart robustness - can you make reward models which are calibrated even under adversarial attack, so that when you optimize them reall

... (read more)

8johnswentworth4mo

All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI). There are perhaps some versions of all four which could be useful, but those versions do not resemble any work I've ever heard of anyone actually doing in any of those categories. That said, many of those do plausibly produce value as propaganda for the political cause of AI safety, especially insofar as they involve demoing scary behaviors. EDIT-TO-ADD: Actually, I guess I do think the singular learning theorists are headed in a useful direction, and that does fall under your "science of generalization" category. Though most of the potential value of that thread is still in interp, not so much black-box calculation of RLCTs.

9Alexander Gietelink Oldenziel4mo

I think we would all be interested to hear you elaborate on why you think these approaches have approximately no value. Perhaps this will be in a follow-up post.

1Bronson Schoen4mo

It’s difficult for me to understand how this could be “basically useless in practice” for: It seems to me you’d want to understand and strongly show how and why different approaches here fail, and in any world where you have something like “outsourcing alignment research” you want some form of oversight.

5Leon Lang4mo

Thanks for the list! I have two questions: 1: Can you explain how generalization of NNs relates to ELK? I can see that it can help with ELK (if you know a reporter generalizes, you can train it on labeled situations and apply it more broadly) or make ELK unnecessary (if weak to strong generalization perfectly works and we never need to understand complex scenarios). But I’m not sure if that’s what you mean. 2: How is goodhart robustness relevant? Most models today don’t seem to use reward functions in deployment, and in training the researchers can control how hard they optimize these functions, so I don’t understand why they necessarily need to be robust under strong optimization.

5yams3mo

There are plenty of cases where John can glance at what people are doing and see pretty clearly that it is not progress toward the hard problem. Importantly, people with the agent foundations class of anxieties (which I embrace; I think John is worried about the right things!) do not spend time engaging on a gears level with prominent prosaic paradigms and connecting the high level objection ("it ignores the hard part of the problem") with the details of the research. "But Tsvi and John actually spend a lot of time doing this." No, they don't! They paraphrase the core concern over and over again, often seemingly without reading the paper. I don't think reading the paper would change your minds (nor should it!), but I think that there's a culture problem tied to this off-hand dismissal of prosaic work that disincentivizes potential agent foundations (or similar new thing that shares the core concerns of agent foundations) researchers from engaging with, i.e., John. Prosaic work is fraught and, much of it, doomed. New researchers over-index on tractability because short feedback loops are comforting ('street-lighting'). Why aren't we explaining why that is, on the terms of the research itself, rather than expecting people to be persuaded by the same high level point getting hammered into them again and again? I've watched this work in real-time. If you listen to someone talk about their work, or read their paper and follow up in person, they are often receptive to a conversation about worlds in which their work is ineffective, evidence that we're likely to be in such a world, and even to shifting the direction of their work in recognition of that evidence. Instead, people with their eye on the ball are doing this tribalistic(-seeming) thing. Yup, the deck is stacked against humanity solving the hard problems; for some reason, folks who know that are also committed to playing their hands poorly, and then blaming (only) the stacked deck! John's recent post on co

5johnswentworth3mo

Big crux here: I don't actually expect useful research to occur as a result of my control-critique post. Even having updated on the discussion remaining more civil than I expected, I still expect basically-zero people to do anything useful as a result. As a comparison: I wrote a couple posts on my AI model delta with Yudkowsky and with Christiano. For each of them, I can imagine changing ~one big piece in my model, and end up with a model which looks basically like theirs. By contrast, when I read the stuff written on the control agenda... it feels like there is no model there at all. (Directionally-correct but probably not quite accurate description:) it feels like whoever's writing, or whoever would buy the control agenda, is just kinda pattern-matching natural language strings without tracking the underlying concepts those strings are supposed to represent. (Joe's recent post on "fake vs real thinking" feels like it's pointing at the right thing here; the posts on control feel strongly like "fake" thinking.) And that's not a problem which gets fixed by engaging at the object level; that type of cognition will mostly not produce useful work, so getting useful work out of such people would require getting them to think in entirely different ways. ... so mostly I've tried to argue at a different level, like e.g. in the Why Not Just... posts. The goal there isn't really to engage the sort of people who would otherwise buy the control agenda, but rather to communicate the underlying problems to the sort of people who would already instinctively feel something is off about the control agenda, and give them more useful frames to work with. Because those are the people who might have any hope of doing something useful, without the whole structure of their cognition needing to change first.

7yams3mo

I think the reason nobody will do anything useful-to-John as a result of the control critique post is that control is explicitly not aiming at the hard parts of the problem, and knows this about itself. In that way, control is an especially poorly selected target if the goal is getting people to do anything useful-to-John. I'd be interested in a similar post on the Alignment Faking paper (or model organisms more broadly), on RAT, on debate, on faithful CoT, on specific interpretability paradigms (circuits v SAEs, vs some coherentist approach vs shards vs....), and would expect those to have higher odds of someone doing something useful-to-John. But useful-to-John isn't really the metric I think the field should be using, either.... I'm kind of picking on you here because you are least guilty of this failing relative to researchers in your reference class. You are actually saying anything at all, sometimes with detail, about how you feel about particular things. However, you wouldn't be my first-pick judge for what's useful; I'd rather live in a world where like half a dozen people in your reference class are spending non-zero time arguing about the details of the above agendas and how they interface with your broader models, so that the researchers working on those things can update based on those critiques (there may even be ways for people to apply the vector implied by y'all's collective input, and generate something new / abandon their doomed plans).

4eggsyntax4mo

It's a bit tangential to the context, but this is a topic I have an ongoing interest in: what leads you to believe that the skeptics (in particular NLP people in the linguistics community) have shifted away from their previous positions? My impression has been that many of them (though not all) have failed to really update to any significant degree. Eg here's a paper from just last month which argues that we must not mistake the mere engineering that is LLM behavior for language understanding or production.

[-]Jan_Kulveit4mo*9373

My guess is a roughly equally "central" problem is the incentive landscape around the OpenPhil/Anthropic school of thought

where you see Sam, I suspect something like "the lab memeplexes". Lab superagents have instrumental convergent goals, and the instrumental convergent goals lead to instrumental, convergent beliefs, and also to instrumental blindspots
there are strong incentives for individual people to adjust their beliefs: money, social status, sense of importance via being close to the Ring
there are also incentives for people setting some of the incentives: funding something making progress on something seems more successful and easier than funding the dreaded theory

[-][anonymous]4mo7013

Humans are really, really bad at doing long chains of abstract reasoning without regular contact with reality, so in practice imo good philosophy has to have feedback loops with reality, otherwise you will get confused.

He was talking about philosophy in particular at that juncture, in response to Wei Dai's concerns over metaphilosophical competence, but this po... (read more)

[-]Noosphere894mo*155

I actually disagree with the natural abstractions research being ungrounded. Indeed, I think there is reason to believe that at least some of the natural abstractions work, especially the natural abstraction hypothesis actually sorts of holds true for today's AI, and thus is the most likely out of the theoretical/agent-foundation approaches to work (I'm usually critical to agent foundations, but John Wentworth's work is an exception that I'd like funding for).

For example, this post does an experiment that shows that OOD data still makes the Platonic Representation Hypothesis true, meaning that it's likely that deeper factors are at play than just shallow similarity:

https://www.lesswrong.com/posts/Su2pg7iwBM55yjQdt/exploring-the-platonic-representation-hypothesis-beyond-in

[-][anonymous]4mo160

I'm wary of a possible equivocation about what the "natural abstraction hypothesis" means here.

If we are referring to the redundant information hypothesis and various kinds of selection theorems, this is a mathematical framework that could end up being correct, is not at all ungrounded, and Wentworth sure seems like the man for the job.

But then you are still left with the task of grounding this framework in physical reality to allow you to make correct empirical predictions about and real-world interventions on what you will see from more advanced models. Our physical world abstracting well seems plausible (not necessarily >50% likely), and these abstractions being "natural" (e.g., in a category-theoretic sense) seems likely conditional on the first clause of this sentence being true, but I give an extremely low probability to the idea that these abstractions will be used by any given general intelligence or (more to the point) advanced AI model to a large and wide enough extent that retargeting the search is even close to possible.

And indeed, it is the latter question that represents the make-or-break moment for natural abstractions' theory of change, for it is only when ... (read more)

9Thane Ruthenis4mo

I think this statement is quite ironic in retrospect, given how OpenAI's o-series seems to work (at train-time and at inference-time both), and how much AI researchers hype it up. By contrast, my understanding is that the sort of search John is talking about retargeting isn't the brute-force babble-and-prune algorithms, but a top-down heuristical-constraint-based search. So it is in fact the ML researchers now who believe in the superiority of the computationally inefficient search; not the agency theorists.

[-]Noosphere894mo*134

Re the OpenAI o-series and search, my initial prediction is that Q*/MCTS search will work well on problems that are easy to verify and and easy to get training data for, and not work if either of these 2 conditions are violated, and secondarily will be reliant on the model having good error correction capabilities to use the search effectively, which is why I expect we can make RL capable of superhuman performance on mathematics/programming with some rather moderate schlep/drudge work, and I also expect cost reductions such that it can actually be practical, but I'm only giving a 50/50 chance by 2028 for superhuman performance as measured by benchmarks in these domains.

I think my main difference from you, Thane Ruthenis is I expect costs to reduce surprisingly rapidly, though this is admittedly untested.

This will accelerate AI progress, but not immediately cause an AI explosion, though in the more extreme paces this could create something like a scenario where programming companies are founded by a few people smartly managing a lot of programming AIs, and programming/mathematics experiencing something like what happened to the news industry from the rise of the internet, where ther... (read more)

9Thane Ruthenis4mo

I'm not strongly committed to the view that the costs won't rapidly reduce: I can certainly see the worlds in which it's possible to efficiently distill trees-of-thought unrolls into single chains of thoughts. Perhaps it scales iteratively, where we train a ML model to handle the next layer of complexity by generating big ToTs, distilling them into CoTs, then generating the next layer of ToTs using these more-competent CoTs, etc. Or perhaps distillation doesn't work that well, and the training/inference costs grow exponentially (combinatorially?).

2Noosphere894mo

Yeah, we will have to wait at least several years. One confound in all of this is that big talent is moving out of OpenAI, which means I'm more bearish on the company's future prospects specifically without it being that much of a detriment towards progress towards AGI.

[-]Rohin Shah4mo112

I think this statement is quite ironic in retrospect, given how OpenAI's o-series seems to work

I stand by my statement and don't think anything about the o-series model invalidates it.

And to be clear, I've expected for many years that early powerful AIs will be expensive to run, and have critiqued people for analyses that implicitly assumed or implied that the first powerful AIs will be cheap, prior to the o-series being released. (Though unfortunately for the two posts I'm thinking of, I made the critiques privately.)

There's a world of difference between "you can get better results by thinking longer" (yeah, obviously this was going to happen) and "the AI system is a mesa optimizer in the strong sense that it has an explicitly represented goal such that you can retarget the search" (I seriously doubt it for the first transformative AIs, and am uncertain for post-singularity superintelligence).

4Thane Ruthenis4mo

To lay out my arguments properly: 1. "Search is ruinously computationally inefficient" does not work as a counter-argument against the retargetability of search, because the inefficiency argument applies to babble-and-prune search, not to the top-down heuristical-constraint-based search that was/is being discussed. There are valid arguments against easily-retargetable heuristics-based search as well (I do expect many learned ML algorithms to be much messier than that). But this isn't one of them. 2. ML researchers are currently incredibly excited about the inference-time scaling laws, talking about inference runs costing millions/billions of dollars, and how much capability will be unlocked this way. The o-series paradigm would use this compute to, essentially, perform babble-and-prune search. The pruning would seem to be done by some easily-swappable evaluator (either the system's own judgement based on the target specified in a prompt, or an external theorem-prover, etc.). If things will indeed go this way, then it would seem that a massive amount of capabilities will be based on highly inefficient babble-and-prune search, and that this search would be easily retargetable by intervening on one compact element of the system (the prompt, or the evaluator function).

8Rohin Shah4mo

4Thane Ruthenis4mo

6Rohin Shah4mo

2Noosphere894mo

Yeah, it hasn't been shown that these abstractions can ultimately be retargeted by default for today's AI.

6Jozdien4mo

(I haven't read your comments you link, so apologies if you've already responded to this point before). I can't speak to most of these simply out of lack of deep familiarity, but I don't think natural abstractions is disqualified at all by this. What do we actually want out of interpretability? I don't think mechanistic interpretability, as it stands currently, gives us explanations of the form we actually want. For example, what are a model's goals? Is it being deceptive? To get answers to those questions, you want to first know what those properties actually look like - you can't get away with identifying activations corresponding to how to deceive humans, because those could relate to a great number of things (e.g. modelling other deceptive agents). Composability is a very non-trivial problem. If you want to answer those questions, you need to find a way to get better measures of whatever property you want to understand. This is the central idea behind Eliciting Latent Knowledge and other work that aims for unsupervised honesty (where the property is honesty), what I call high-level interpretability of inner search / objectives, etc. Natural abstractions is more agnostic about what kinds of properties we would care about, and tries to identify universal building blocks for any high-level property like this. I am much more optimistic about picking a property and going with it, and I think this makes the problem easier, but that seems like a different disagreement than yours considering both are inevitably somewhat conceptual and require more prescriptive work than work focusing solely on frontier models. If you wanted to get good handles to steer your model at all, you're going to have to do something like figuring out the nature of the properties you care about. You can definitely make that probem easier by focusing on how those properties instantiate in specific classes of systems like LLMs or neural nets (and I do in my work), but you still have to deal

4mesaoptimizer4mo

Even if I'd agree with your conclusion, your argument seems quite incorrect to me. That's what math always is. The applicability of any math depends on how well the mathematical models reflect the situation involved. It seems very unlikely to me that you'd have many 'similar-looking mathematical models'. If a class of real-world situations seems to be abstracted in multiple ways such that you have hundreds (not even millions) of mathematical models that supposedly could capture its essence, maybe you are making a mistake somewhere in your modelling. Abstract away the variations. From my experience, you may have a small bunch of mathematical models that could likely capture the essence of the class of real-world situations, and you may debate with your friends about which one is more appropriate, but you will not have 'multiple similar-looking models'. Nevertheless, I agree with your general sentiment. I feel like humans will find it quite difficult make research progress without concrete feedback loops, and that actually trying stuff with existing examples of models (that is, the stuff that Anthropic and Apollo are doing, for example) provide valuable data points. I also recommend maybe not spending so much time reading LessWrong and instead reading STEM textbooks.

[-]testingthewaters4mo6726

Epistemic status: This is a work of satire. I mean it---it is a mean-spirited and unfair assessment of the situation. It is also how, some days, I sincerely feel.

A minivan is driving down a mountain road, headed towards a cliff's edge with no guardrails. The driver floors the accelerator.

Passenger 1: "Perhaps we should slow down somewhat."

Passengers 2, 3, 4: "Yeah, that seems sensible."

Driver: "No can do. We're about to be late to the wedding."

Passenger 2: "Since the driver won't slow down, I should work on building rocket boosters so that (when we inevitably go flying off the cliff edge) the van can fly us to the wedding instead."

Passenger 3: "That seems expensive."

Passenger 2: "No worries, I've hooked up some funding from Acceleration Capital. With a few hours of tinkering we should get it done."

Passenger 1: "Hey, doesn't Acceleration Capital just want vehicles to accelerate, without regard to safety?"

Passenger 2: "Sure, but we'll steer the funding such that the money goes to building safe and controllable rocket boosters."

The van doesn't slow down. The cliff looks closer now.

Passenger 3: [looking at what Passenger 2 is building] "Uh, haven't you just made a faster engine?"

Passen... (read more)

[-]romeostevensit4mo19-3

unfortunately, the disanalogy is that any driver who moves their foot towards the brakes is almost instantly replaced with one who won't.

3testingthewaters4mo

Even so, it seems obvious to me that addressing the mysterious issue of the accelerating drivers is the primary crux in this scenario.

4romeostevensit4mo

Yes, and I don't mean to overstate a case for helplessness. Demons love convincing people that the anti demon button doesn't work so that they never press it even though it is sitting right out in the open.

[-]faul_sname4mo180

Driver: My map doesn't show any cliffs

Passenger 1: Have you turned on the terrain map? Mine shows a sharp turn next to a steep drop coming up in about a mile

Passenger 5: Guys maybe we should look out the windshield instead of down at our maps?

Driver: No, passenger 1, see on your map that's an alternate route, the route we're on doesn't show any cliffs.

Passenger 1: You don't have it set to show terrain.

Passenger 6: I'm on the phone with the governor now, we're talking about what it would take to set a 5 mile per hour national speed limit.

Passenger 7: Don't you live in a different state?

Passenger 5: The road seems to be going up into the mountains, though all the curves I can see from here are gentle and smooth.

Driver and all passengers in unison: Shut up passenger 5, we're trying to figure out if we're going to fall off a cliff here, and if so what we should do about it.

Passenger 7: Anyway, I think what we really need to do to ensure our safety is to outlaw automobiles entirely.

Passenger 3: The highest point on Earth is 8849m above sea level, and the lowest point is 430 meters below sea level, so the cliff in front of us could be as high as 9279m.

[-]Charbel-Raphaël4mo5319

I think I do agree with some points in this post. This failure mode is the same as the one I mentioned about why people are doing interpretability for instance (section Outside view: The proportion of junior researchers doing Interp rather than other technical work is too high), and I do think that this generalizes somewhat to whole field of alignment. But I'm highly skeptical that recruiting a bunch of physicists to work on alignment would be that productive:

Empirically, we've already kind of tested this, and it doesn't work.
- I don't think that what Scott Aaronson produced while at OpenAI had really helped AI Safety: He is exactly doing what is criticized in the post: Streetlight research and using techniques that he was already familiar with from his previous field of research, I don't think the author of the OP would disagree with me. Maybe n=1, but it was one of the most promising shots.
- Two years ago, I was doing field-building and trying to source talent, primarily selecting based on pure intellect and raw IQ. I've organized the Von Neumann Symposium around the problem of corrigibility, I targeted IMO laureates, and individuals from the best school in France, ENS Ulm, which arg

... (read more)

6Noosphere894mo

The big issue is that a lot of the swiss cheese strategy assumes failures are sufficiently uncorrelated that multiple defense layers stack, but AI can coordinate failures such that unlikely events become probable ones even through layers of defenses. However, I think AI progress is slow and continuous enough that I do think that swiss cheese models are reasonably useful, and I do think there's a band of performance where optimization doesn't totally obsolete the strategy.

4Chris_Leong1mo

The problem is that the Swiss cheese model and legislative efforts primarily just buy us time. We still need to be making progress towards a solution and whilst it's good for some folk to bet on us duct-taping our way through, I think we also want some folk attempting to work on things that are more principled.

4Noosphere894mo

I'd probably add AI control to this list, as it's a method to use AIs of a specific capability range without AIs escaping even assuming misalignment of AIs. Unusually relative to most AI governance people, I think regulation is most helpful in cases where AI alignment succeeds by a combination of instruction following/personal intent alignment, but no CEV of humanity occurs, and CEV alignment only occurs some of the time (and even then, it's personal CEV alignment), which I think is the most plausible world right now.

6Charbel-Raphaël4mo

No, AI control doesn't pass the bar, because we're still probably fucked until we have a solution for open source AI or race for superintelligence, for example, and OpenAI doesn't seem to be planning to use control, so this looks to me like the research that's sort of being done in your garage but ignored by the labs (and that's sad, control is great I agree).

4ryan_greenblatt4mo

I think this somewhat understates the level of buy in from labs. I agree that "quickly building superintelligence" makes control look notably less appealing. (Though note that this also applies for any prosaic method that is unlikely to directly productively scale to superintelligence.) I'm not very worried about open source AI at the moment, but I am quite worried about inadequate security undermining control and other hopes.

6Charbel-Raphaël4mo

Maybe you have some information that I don't have about the labs and the buy-in? You think this applies to OpenAI and not just Anthropic? But as far as open source goes, I'm not sure. Deepseek? Meta? Mistral? xAI? Some big labs are just producing open source stuff. DeepSeek is maybe only 6 months behind. Is that enough headway? It seems to me that the tipping point for many people (I don't know for you) about open source is whether or not open source is better than close source, so this is a relative tipping point in terms of capabilities. But I think we should be thinking about absolute capabilities. For example, what about bioterrorism? At some point, it's going to be widely accessible. Maybe the community only cares about X-risks, but personally I don't want to die either. Is there a good explanation online of why I shouldn't be afraid of open-source?

6ryan_greenblatt4mo

As far as open source, the quick argument is that once AI becomes sufficiently powerful, it's unlikely that the incentives are toward open sourcing it (including goverment incentives). This isn't totally obvious though, and this doesn't rule out catastrophic bioterrorism (more like COVID scale than extinction scale) prior to AI powerful enough to substantially accelerate R&D across many sectors (including bio). It also doesn't rule out powerful AI being open sourced years after it is first created (though the world might be radically transformed by this point anyway). I don't have that much of an inside view on this, but reasonable people I talk to are skeptical that open source is a very big deal (in >20% of worlds) from at least an x-risk perspective. (Seems very sensitive to questions about government response, how much stuff is driven by ideology, and how much people end up being compelled (rightly or not) by "commoditize your complement" (and ecosystem) economic arguments.) Open source seems good on current margins, at least to the extent it doesn't leak algorithmic advances / similar.

2Charbel-Raphaël4mo

I would be happy to discuss in a dialogue about this. This seems to be an important topic, and I'm really unsure about many parameters here.

4WCargo4mo

I agree with claim 2-3 but not with claim 1 * I think « random physicist » is not super fair, it looks like from his stand point he indeed met physicist willing to do « alignment » research, and had backgrounds in research and developping theory * We didn’t find Phd student to work on alignment but also we didn’t try (at least not cesia / effisciences) * Its true that most of the people we find that wanted to work on the problem were the motivated ones, but from the point of view of the alignment problem still recruiting them could be a mistake (saturating the field etc)

2Charbel-Raphaël4mo

What do you think of my point about Scott Aaronson? Also, since you agree with points 2 and 3, it seems that you also think that the most useful work from last year didn't require advanced physics, so isn't this a contradiction with you disagreing with point 1?

[-]Alexander Gietelink Oldenziel4mo*507

hi John,

Let's talk about a hypothetical physicist-turned-alignment researcher who, (for no particular reason), we'll call John. This researcher needs to periodically repeat to himself that Only Physicists do Real Work, but also needs to write an Alignment Post-Mortem. Maybe he needs the Post-Mortem as a philosophical fig leaf to justify his own program lack of empirical grounding, or maybe he's honestly concerned but not very good at seeing the Planck in his own eye. Either way, he meets with two approaches to writing critiques - let's call them (again for no particular reason) "careful charitable engagement" and "provocative dismissal." Both can be effective, but they have very different impacts on community discourse. It turns out that careful engagement requires actually understanding what other researchers are doing, while provocative dismissal lets you write spicy posts from your armchair. Lo and behold, John endorses provocative dismissal as his Official Blogging Strategy, and continues writing critiques. (Actually the version which eventually emerges isn't even fully researched, it's a quite different version which just-so-happens to be even more dismissive of work he hasn't closely followed.)

Glad to hear you enjoyed ILIAD.

Best,

AGO

[-]Ryan Kidd4mo*4613

Alice is excited about the eliciting latent knowledge (ELK) doc, and spends a few months working on it. Bob is excited about debate, and spends a few months working on it. At the end of those few months, Alice has a much better understanding of how and why ELK is hard, has correctly realized that she has no traction on it at all, and pivots to working on technical governance. Bob, meanwhile, has some toy but tangible outputs, and feels like he's making progress.

I don't want to respond to the examples rather than the underlying argument, but it seems necessary here: this seems like a massively overconfident claim about ELK and debate that I don't believe is justified by popular theoretical worst-case objections. I think a common cause of "worlds where iterative design fails" is "iterative design seems hard and we stopped at the first apparent hurdle." Sure, in some worlds we can rule out entire classes of solutions via strong theoretical arguments (e.g., "no-go theorems"); but that is not the case here. If I felt confident that the theory-level objections to ELK and debate ruled out hodge-podge solutions, I would abandon hope in these research directions and drop them from the... (read more)

[-]Ryan Kidd4mo110

Some caveats:

A crucial part of the "hodge-podge alignment feedback loop" is "propose new candidate solutions, often grounded in theoretical models." I don't want to entirely focus on empirically fleshing out existing research directions to the exclusion of proposing new candidate directions. However, it seems that, often, new on-paradigm research directions emerge in the process of iterating on old ones!
"Playing theoretical builder-breaker" is an important skill and I think this should be taught more widely. "Iterators," as I conceive of them, are capable of playing this game well, in addition to empirically testing these theoretical insights against reality. John, to his credit, did a great job of emphasizing the importance of this skill with his MATS workshops on the alignment game tree and similar.
I don't want to entirely trust in alignments MVPs, so I strongly support empirical research that aims to show the failure modes of this meta-strategy. I additionally support the creation of novel strategic paradigms, though I think this is quite hard. IMO, our best paradigm-level insights as a field largely come from interdisciplinary knowledge transfer (e.g., from economics, game theo

... (read more)

4Noosphere894mo

Yeah, the worst-case ELK problem could well have no solution, but in practice alignment is solvable either by other methods or by having an ELK solution that does work on a large classes of AIs like neural nets, so Alice is plausibly making a big mistake, and a crux here is that I don't believe we will ever get no-go theorems, or even arguments to the standard level of rigor in physics because I believe alignment has pretty lax constraints, so a lot of solutions can appear. The relevant sentence below:

[-]Tahp4mo411

I am a physics PhD student. I study field theory. I have a list of projects I've thrown myself at with inadequate technical background (to start with) and figured out. I've convinced a bunch of people at a research institute that they should keep giving me money to solve physics problems. I've been following LessWrong with interest for years. I think that AI is going to kill us all, and would prefer to live for longer if I can pull it off. So what do I do to see if I have anything to contribute to alignment research? Maybe I'm flattering myself here, but I sound like I might be a person of interest for people who care about the pipeline. I don't feel like a great candidate because I don't have any concrete ideas for AI research topics to chase down, but it sure seems like I might start having ideas if I worked on the problem with somebody for a bit. I'm apparently very ok with being an underpaid gopher to someone with grand theoretical ambitions while I learn the material necessary to come up with my own ideas. My only lead to go on is "go look for something interesting in MATS and apply to it" but that sounds like a great way to end up doing streetlight research because I don't un... (read more)

[-]DusanDNesic4mo3629

It sounds like you should apply for the PIBBSS Fellowship! (https://pibbss.ai/fellowship/)

[-]Buck4mo228

Going to MATS is also an opportunity to learn a lot more about the space of AI safety research, e.g. considering the arguments for different research directions and learning about different opportunities to contribute. Even if the "streetlight research" project you do is kind of useless (entirely possible), doing MATS is plausibly a pretty good option.

[-]TsviBT4mo4440

MATS will push you to streetlight much more unless you have some special ability to have it not do that.

[-]Buck4mo1712

Do you mean during the program? Sure, maybe the only MATS offers you can get are for projects you think aren't useful--I think some MATS projects are pretty useless (e.g. our dear OP's). But it's still an opportunity to argue with other people about the problems in the field and see whether anyone has good justifications for their prioritization. And you can stop doing the streetlight stuff afterwards if you want to.

Remember that the top-level commenter here is currently a physicist, so it's not like the usefulness of their work would be going down by doing a useless MATS project :P

[-]TsviBT4mo2114

Remember that the top-level commenter here is currently a physicist, so it's not like the usefulness of their work would be going down by doing a useless MATS project :P

Yes it would! It would eat up motivation and energy and hope that they could have put towards actual research. And it would put them in a social context where they are pressured to orient themselves toward streetlighty research--not just during the program, but also afterward. Unless they have some special ability to have it not do that.

Without MATS: not currently doing anything directly useful (though maybe indirectly useful, e.g. gaining problem-solving skill). Could, if given $30k/year, start doing real AGI alignment thinking from scratch not from scratch, thereby scratching their "will you think in a way that unlocks understanding of strong minds" lottery ticket that each person gets.

With MATS: gotta apply to extension, write my LTFF grant. Which org should I apply to? Should I do linear probes software engineering? Or evals? Red teaming? CoT? Constitution? Hyperparamter gippity? Honeypot? Scaling supervision? Superalign, better than regular align? Detecting deception?

[-]Ryan Kidd4mo268

Obviously I disagree with Tsvi regarding the value of MATS to the proto-alignment researcher; I think being exposed to high quality mentorship and peer-sourced red-teaming of your research ideas is incredibly valuable for emerging researchers. However, he makes a good point: ideally, scholars shouldn't feel pushed to write highly competitive LTFF grant applications so soon into their research careers; there should be longer-term unconditional funding opportunities. I would love to unlock this so that a subset of scholars can explore diverse research directions for 1-2 years without 6-month grant timelines looming over them. Currently cooking something in this space.

6Quinn4mo

The upvotes and agree votes on this comment updated my perception of the rough consensus about mats and streetlighting. I previously would have expected less people to evaluate mats that way

[-]Ryan Kidd4mo1711

You could consider doing MATS as "I don't know what to do, so I'll try my hand at something a decent number of apparent experts consider worthwhile and meanwhile bootstrap a deep understanding of this subfield and a shallow understanding of a dozen other subfields pursued by my peers." This seems like a common MATS experience and I think this is a good thing.

[-]the gears to ascension4mo177

The first step would probably be to avoid letting the existing field influence you too much. Instead, consider from scratch what the problems of minds and AI are, how they relate to reality and to other problems, and try to grab them with intellectual tools you're familiar with. Talk to other physicists and try to get into exploratory conversation that does not rely on existing knowledge. If you look at the existing field, look at it like you're studying aliens anthropologically.

9yams4mo

[was a manager at MATS until recently and want to flesh out the thing Buck said a bit more] It’s common for researchers to switch subfields, and extremely common for MATS scholars to get work doing something different from what they did at MATS. (Kosoy has had scholars go on to ARC, Neel scholars have ended up in scalable oversight, Evan’s scholars have a massive spread in their trajectories; there are many more examples but it’s 3 AM.) Also I wouldn’t advise applying to something that seems interesting; I’d advise applying for literally everything (unless you know for sure you don’t want to work with Neel, since his app is very time intensive). The acceptance rate is ~4 percent, so better to maximize your odds (again, for most scholars, the bulk of the value is not in their specific research output over the 10 week period, but in having the experience at all). Also please see Ryan’s replies to Tsvi on the talent needs report for more notes on the street lighting concern as it pertains to MATS. There’s a pretty big back and forth there (I don’t cleanly agree with one side or the other, but it might be useful to you).

8plex4mo

If you're mobile (able to be in the UK) and willing to try a different lifestyle, consider going to the EA hotel aka CEEALAR, they offer free food and accomodation for a bunch of people, including many people working on AI safety. Alternatively, taking a quick look at https://www.aisafety.com/funders, the current best options are maybe LTFF, OpenPhil, CLR, or maybe AE Studios?

5Linda Linsefors4mo

Is it an option to keep your current job but but spend your research hours on AI Safety instead of quarks? Is this something that would appealing to you + acceptable to your employer? Given the current AI safety funding situation, I would strongly reccomend not giving up your current income. I think that a lot of the pressure towards street light research comes from the funding situation. The grants are short and to stay in the game you need visible results quickly. I think MATS could be good, if you can treat it as exploration, but not so good if you're in a hurry to get a job or a grant directly afterwards. Since MATS is 3 months of full time, it might not fit into your schedule (without quitting your job). Maybe instead try SPAR. Or see here for more options. Or you can skip the training program route, and just start reading on your own. There's lots and lots of AI safety reding lists. I reccomend this one for you. @Lucius Bushnaq who created and maintains it, also did quark physics, before switching to AI Safety. But if you don't like it, there are more options here under "Self studies". In general, the funding situation in AI safety is pretty shit right now, but other than that, there are so many resources to help people get started. It's just a matter of choosing where to start.

2Lucius Bushnaq4mo

(I have not maintained this list in many months, sorry.)

2Linda Linsefors4mo

That's still pretty good. Most reading lists are not updated at all after publication.

1Tahp4mo

My current job is only offered to me on the condition that I am doing physics research. I have some flexibility to do other things at the same time though. The insights and resources you list seem useful to me, so thank you.

4Linda Linsefors4mo

Ok, in that case I want to give you this post as inspiration. Changing the world through slack & hobbies — LessWrong

2cdt4mo

I am surprised that you find theoretical physics research less tight funding-wise than AI alignment [is this because the paths to funding in physics are well-worn, rather than better resourced?]. This whole post was a little discouraging. I hope that the research community can find a way forward.

[-]Ryan Kidd4mo374

On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportionately STEM-focused undergrads, and looks pretty typical for STEM-focused undergrads. We’re talking about pretty standard STEM majors from pretty standard schools, neither the very high end nor the very low end of the skill spectrum.

At least from the MATS perspective, this seems quite wrong. Only ~20% of MATS scholars in the last ~4 programs have been undergrads. In the most recent application round, the dominant sources of applicants were, in order, personal recommendations, X posts, AI Safety Fundamentals courses, LessWrong, 80,000 Hours, then AI safety student groups. About half of accepted scholars tend to be students and the other half are working professionals.

[-]Thomas Kwa4mo*328

TLDR:

What OP calls "streetlighting", I call an efficient way to prioritize problems by tractability. This is only a problem insofar as we cannot also prioritize by relevance.
I think problematic streetlighting is largely due to incentives, not because people are not smart / technically skilled enough. Therefore solutions should fix incentives rather than just recruiting smarter people.

First, let me establish that theorists very often disagree on what the hard parts of the alignment problem are, precisely because not enough theoretical and empirical progress has been made to generate agreement on them. All the lists of "core hard problems" OP lists are different, and Paul Christiano famously wrote a 27-point list of disagreements on Eliezer's. This means that most people's views of the problem are wrong, and should they stick to their guns they might perseverate on either an irrelevant problem or a doomed approach.

I'd guess that historically perseveration has been an equally large problem as streetlighting among alignment researchers. Think of all the top alignment researchers in 2018 and all the agendas that haven't seen much progress. Decision theory should probably not take ~30% o... (read more)

[-]Stephen Fowler4mo329

Robin Hanson recently wrote about two dynamics that can emerge among individuals within an organisations when working as a group to reach decisions. These are the "outcome game" and the "consensus game."

In the outcome game, individuals aim to be seen as advocating for decisions that are later proven correct. In contrast, the consensus game focuses on advocating for decisions that are most immediately popular within the organization. When most participants play the consensus game, the quality of decision-making suffers.

The incentive structure within an organization influences which game people play. When feedback on decisions is immediate and substantial, individuals are more likely to engage in the outcome game. Hanson argues that capitalism's key strength is its ability to make outcome games more relevant.

However, if an organization is insulated from the consequences of its decisions or feedback is delayed, playing the consensus game becomes the best strategy for gaining resources and influence.

This dynamic is particularly relevant in the field of (existential) AI Safety, which needs to develop strategies to mitigate risks before AGI is developed. Currently, we have zero con... (read more)

[-]TsviBT4mo176

Currently, we have zero concrete feedback about which strategies can effectively align complex systems of equal or greater intelligence to humans.

Actually, I now suspect this is to a significant extent disinformation. You can tell when ideas make sense if you think hard about them. There's plenty of feedback, that's not already being taken advantage of, at the level of "abstract, high-level, philosophy of mind", about the questions of alignment.

4philh4mo

That's not really "concrete" feedback though, right? In the outcome game/consensus game dynamic Stephen's talking about, it seems hard to play an outcome game with that kind of feedback.

9TsviBT4mo

I'm not sure what "concrete" is supposed to mean; for the one or two senses I immediately imagine, no, I would say the feedback is indeed concrete. In terms of consensus/outcome, no, I think the feedback is actually concrete. There is a difficulty, which is that there's a much smaller set of people to whom the outcomes are visible. As an analogy/example: feedback in higher math. It's "nonconcrete" in that it's "just verbal arguments" (and translating those into something much more objective, like a computer proof, is a big separate long undertaking). And there's a much smaller set of people who can tell what statements are true in the domain. There might even be a bunch more people who have opinions, and can say vaguely related things that other non-experts can't distinguish from expert statements, and who therefore form an apparent consensus that's wrong + ungrounded. But one shouldn't conclude from those facts that math is less real, or less truthtracking, or less available for communities to learn about directly.

3stavros4mo

Thanks for linking this post. I think it has a nice harmony with Prestige vs Dominance status games. I agree that this is a dynamic that is strongly shaping AI Safety, but would specify that it's inherited from the non-profit space in general - EA originated with the claim that it could do outcome focused altruism, but.. there's still a lot of room for improvement, and I'm not even sure we're improving. The underlying dynamics and feedback loops are working against us, and I don't see evidence that core EA funders/orgs are doing more than pay lip service to this problem.

[-]Jozdien4mo2710

Thank you for writing this post. I'm probably slightly more optimistic than you on some of the streetlighting approaches, but I've also been extremely frustrated that we don't have anything better, when we could.

That means I'm going to have to speculate about how lots of researchers are being stupid internally, when those researchers themselves would probably say that they are not being stupid at all and I'm being totally unfair.

I've seen discussions from people who I vehemently disagreed that did similar things and felt very frustrated by not being able to defend my views with greater bandwidth. This isn't a criticism of this post - I think a non-zero number of those are plausibly good - but: I'd be happy to talk at length with anyone who feels like this post is unfair to them, about our respective views. I likely can't do as good a job as John can (not least because our models aren't identical), but I probably have more energy for talking to alignment researchers^[1].

On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportio

... (read more)

8Quinn4mo

A few of the dopest people i know, who id love to have on the team, fall roughly into the category of "engaged and little with lesswrong, grok the core pset better than most 'highly involved' people, but are working on something irrelevant and not even trying cuz they think it seems too hard". They have some thoughtful p(doom), but assume they're powerless.

6Quinn4mo

Richard ngo tweeted recently that it was a mistake to design the agi safety fundamentals curriculum to be broadly accessible, that if he could do it over again thered be punishing problem sets that alienate most people

8_will_4mo

Any chance you have a link to this tweet? (I just tried control+f'ing through @Richard's tweets over the past 5 months, but couldn't find it.)

7Richard_Ngo4mo

FWIW twitter search is ridiculously bad, it's often better to use google instead. In this case I had it as the second result when I googled "richardmcngo twitter safety fundamentals" (richardmcngo being my twitter handle).

5Jozdien4mo

I believe this is the tweet.

2Quinn4mo

I tried a little myself too. Hope I didn't misremembering.

[-]TsviBT4mo261

Cf. https://www.lesswrong.com/posts/QzQQvGJYDeaDE4Cfg/talent-needs-of-technical-ai-safety-teams?commentId=BNkpTqwcgMjLhiC8L

https://www.lesswrong.com/posts/unCG3rhyMJpGJpoLd/koan-divining-alien-datastructures-from-ram-activations?commentId=apD6dek5zmjaqeoGD

https://www.lesswrong.com/posts/HbkNAyAoa4gCnuzwa/wei-dai-s-shortform?commentId=uMaQvtXErEqc67yLj

[-]Stephen McAleese4mo2111

Here's a Facebook post by Yann LeCun from 2017 which has a similar message to this post and seems quite insightful:

My take on Ali Rahimi's "Test of Time" award talk at NIPS.
Ali gave an entertaining and well-delivered talk. But I fundamentally disagree with the message.
The main message was, in essence, that the current practice in machine learning is akin to "alchemy" (his word).
It's insulting, yes. But never mind that: It's wrong!
Ali complained about the lack of (theoretical) understanding of many methods that are currently used in ML, particularly in deep learning.
Understanding (theoretical or otherwise) is a good thing. It's the very purpose of many of us in the NIPS community.
But another important goal is inventing new methods, new techniques, and yes, new tricks.
In the history of science and technology, the engineering artifacts have almost always preceded the theoretical understanding: the lens and the telescope preceded optics theory, the steam engine preceded thermodynamics, the airplane preceded flight aerodynamics, radio and data communication preceded information theory, the computer preceded computer science.
Why? Because theorists will spontaneously study "si

... (read more)

9Linda Linsefors4mo

I think Singular Learning Theory was developed independently of deep learning, and is not specifically about deep learning. It's about any learning system, under some assumptions, which are more general than the assumptions for normal Learning Theory. This is why you can use SLT but not normal Learning Theory when analysing NNs. NNs break the assumptions for normal Learning Theory but not for SLT.

[-]Nate Showell4mo*21-11

My view of the development of the field of AI alignment is pretty much the exact opposite of yours: theoretical agent foundations research, what you describe as research on the hard parts of the alignment problem, is a castle in the clouds. Only when alignment researchers started experimenting with real-world machine learning models did AI alignment become grounded in reality. The biggest epistemic failure in the history of the AI alignment community was waiting too long to make this transition.

Early arguments for the possibility of AI existential risk (as seen, for example, in the Sequences) were largely based on 1) rough analogies, especially to evolution, and 2) simplifying assumptions about the structure and properties of AGI. For example, agent foundations research sometimes assumes that AGI has infinite compute or that it has a strict boundary between its internal decision processes and the outside world.

As neural networks started to see increasing success at a wide variety of problems in the mid-2010s, it started to become apparent that the analogies and assumptions behind early AI x-risk cases didn't apply to them. The process of developing an ML model isn't very similar to... (read more)

[-]habryka4mo1812

Given that you speak with such great confidence that historical arguments for AI X-risk were not grounded, can you give me any "grounded" predictions about what superintelligent systems will do? (which I think we both agree is ultimately what will determine the fate of the world and universe)

If you make some concrete predictions then we can start arguing about the validity, but I find this kind of "mightier than thou" attitude where people keep making ill-defined statements like "these things are theoretical and don't apply", but without actually providing any answers to the crucial questions.

Indeed, not only that, I am confident that if you were to try to predict what will happen with superintelligence, you would very quickly start drawing on the obvious analogies to optimizers and dutch book arguments and evolution and goodhearts law, because we really don't have anything better.

4Nate Showell4mo

Some concrete predictions: * The behavior of the ASI will be a collection of heuristics that are activated in different contexts. * The ASI's software will not have any component that can be singled out as the utility function, although it may have a component that sets a reinforcement schedule. * The ASI will not wirehead. * The ASI's world-model won't have a single unambiguous self-versus-world boundary. The situational awareness of the ASI will have more in common with that of an advanced meditator than it does with that of an idealized game-theoretic agent.

[-]habryka4mo1516

I... am not very impressed by these predictions.

First, I don't think these are controversial predictions on LW (yes, a few people might disagree with him, but there is little boldness or disagreement with widely held beliefs in here), but most importantly, these predictions aren't about anything I care about. I don't care whether the world-model will have a single unambiguous self-versus-world boundary, I care whether the system is likely to convert the solar system into some form of computronium, or launch Dyson probes, or eliminate all potential threats and enemies, or whether the system will try to subvert attempts at controlling it, or whether it will try to amass large amounts of resources to achieve its aims, or be capable of causing large controlled effects via small information channels, or is capable of discovering new technologies with great offensive power.

The only bold prediction here is maybe "the behavior of the ASI will be a collection of heuristics", and indeed would take a bet against this. Systems under reflection and extensive self-improvement stop being well-described by contextual heuristics, and it's likely ASI will both self-reflect and self-improve (as... (read more)

1Signer4mo

The trend may be bounded, the trend may not go far by the time AI can invent nanotechnology - would be great if someone actually measured such things. And there being a trend at all is not predicted by utility-maximization frame, right?

4Leon Lang4mo

“heuristics activated in different contexts” is a very broad prediction. If “heuristics” include reasoning heuristics, then this probably includes highly goal-oriented agents like Hitler. Also, some heuristics will be more powerful and/or more goal-directed, and those might try to preserve themselves (or sufficiently similar processes) more so than the shallow heuristics. Thus, I think eventually, it is plausible that a superintelligence looks increasingly like a goal-maximizer.

[-]habryka4mo*1234

For example, agent foundations research sometimes assumes that AGI has infinite compute or that it has a strict boundary between its internal decision processes and the outside world.

It's one of the most standard results in ML that neural nets are universal function approximators. In the context of that proof, ML de-facto also assumes that you have infinite computing power. It's just a standard tool in ML, AI or CS to see what models predict when you take them to infinity. Indeed, it's really one of the most standard tools in the modern math toolbox, used by every STEM discipline I can think of.

Similarly, separating the boundary between internal decision processes and the outside world continues to be a standard assumption in ML. It's really hard to avoid, everything gets very loopy and tricky, and yes, we have to deal with that loopiness and trickiness, but if anything, agent foundations people were the actual people trying to figure out how to handle that loopiness and trickiness, whereas the ML community really has done very little to handle it. In contrary to your statement here, people on LW have been for years pointing out how embedded agency is really important, and been dismissed by active practitioners because they think the cartesian boundary here is just fine for "real" and "grounded" applications like "predicting the next token" which clearly don't have relevance to these weird and crazy scenarios about power-seeking AIs developing contextual awareness that you are talking about.

3RHollerith4mo

You do realize that by "alignment", the OP (John) is not talking about techniques that prevent an AI that is less generally capable than a capable person from insulting the user or expressing racist sentiments? We seek a methodology for constructing an AI that either ensures that the AI turns out not to be able to easily outsmart us or (if it does turn out to be able to easily outsmart us) ensures (or makes it unlikely) that it won't kill us all or do something other terrible thing. (The former is not researched much compared to the latter, but I felt the need to include it for completeness.) The way it is now, it is not even clear whether you and the OP (John) are talking about the same thing (because "alignment" has come to have a broad meaning). If you want to continue the conversation, it would help to know whether you see a pressing need for a methodology of the type I describe above. (Many AI researchers do not: they think that outcomes like human extinction are quite unlikely or at least easy to avoid.)

[-]Zach Stein-Perlman4mo201

This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post.

Yep. This post is not for me but I'll say a thing that annoyed me anyway:

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems.

Does this actually happen? (Even if you want to be maximally cynical, I claim presenting novel important difficulties (e.g. "sensor tampering") or giving novel arguments that problems are difficult is socially rewarded.)

[-]TsviBT4mo1714

Does this actually happen?

Yes, absolutely. Five years ago, people were more honest about it, saying ~explicitly and out loud "ah, the real problems are too difficult; and I must eat and have friends; so I will work on something else, and see if I can get funding on the basis that it's vaguely related to AI and safety".

[-]stavros4mo1011

To the extent that anecdata is meaningful:

I have met somewhere between 100-200 AI Safety people in the past ~2 years; people for whom AI Safety is their 'main thing'.

The vast majority of them are doing tractable/legible/comfortable things. Most are surprisingly naive; have less awareness of the space than I do (and I'm just a generalist lurker who finds this stuff interesting; not actively working on the problem).

Few are actually staring into the void of the hard problems; where hard here is loosely defined as 'unknown unknowns, here be dragons, where do I even start'.

Fewer still progress from staring into the void to actually trying things.

I think some amount of this is natural and to be expected; I think even in an ideal world we probably still have a similar breakdown - a majority who aren't contributing (yet)^[1], a minority who are - and I think the difference is more in the size of those groups.

I think it's reasonable to aim for a larger, higher quality, minority; I think it's tractable to achieve progress through mindfully shaping the funding landscape.

^{^}
Think it's worth mentioning that all newbies are useless, and not all newbies remain newbies. Some portion of the majority a

... (read more)

6TsviBT4mo

This isn't clear to me, where the crux (though maybe it shouldn't be) is "is it feasible for any substantial funders to distinguish actually-trying research from other".

4Zach Stein-Perlman4mo

Yeah, I agree sometimes people decide to work on problems largely because they're tractable [edit: or because they’re good for safety getting alignment research or other good work out of early AGIs]. I'm unconvinced of the flinching away or dishonest characterization.

6TsviBT4mo

Do you think that funders are aware that >90% [citation needed!] of the money they give to people, to do work described as helping with "how to make world-as-we-know-it ending AGI without it killing everyone", is going to people who don't even themselves seriously claim to be doing research that would plausibly help with that goal? If they are aware of that, why would they do that? If they aren't aware of it, don't you think that it should at least be among your very top hypotheses, that those researchers are behaving materially deceptively, one way or another, call it what you will?

6Zach Stein-Perlman4mo

I do not. On the contrary, I think ~all of the "alignment researchers" I know claim to be working on the big problem, and I think ~90% of them are indeed doing work that looks good in terms of the big problem. (Researchers I don't know are likely substantially worse but not a ton.) In particular I think all of the alignment-orgs-I'm-socially-close-to do work that looks good in terms of the big problem: Redwood, METR, ARC. And I think the other well-known orgs are also good. This doesn't feel odd: these people are smart and actually care about the big problem; if their work was in the even if this succeeds it obviously wouldn't be helpful category they'd want to know (and, given the "obviously," would figure that out). ---------------------------------------- Possibly the situation is very different in academia or MATS-land; for now I'm just talking about the people around me.

4Zach Stein-Perlman4mo

I feel like John's view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I'm pretty sure that's false.) I assume John doesn't believe that, and I wonder why he doesn't think his view entails it.

5johnswentworth4mo

From the post:

5Zach Stein-Perlman4mo

Yeah. I agree/concede that you can explain why you can't convince people that their own work is useless. But if you're positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.

[-]TsviBT4mo204

The flinches aren't structureless particulars. Rather, they involve warping various perceptions. Those warped perceptions generalize a lot, causing other flaws to be hidden.

As a toy example, you could imagine someone attached to the idea of AI boxing. At first they say it's impossible to break out / trick you / know about the world / whatever. Then you convince them otherwise--that the AI can do RSI internally, and superhumanly solve computer hacking / protein folding / persuasion / etc. But they are attached to AI boxing. So they warp their perception, clamping "can an AI be very superhumanly capable" to "no". That clamping causes them to also not see the flaws in the plan "we'll deploy our AIs in a staged manner, see how they behave, and then recall them if they behave poorly", because they don't think RSI is feasible, they don't think extreme persuasion is feasible, etc.

A more real example is, say, people thinking of "structures for decision making", e.g. constitutions. You explain that these things, they are not reflectively stable. And now this person can't understand reflective stability in general, so they don't understand why steering vectors won't work, or why lesioning wo... (read more)

9johnswentworth4mo

My impression, from conversations with many people, is that the claim which gets clamped to True is not "this research direction will/can solve alignment" but instead "my research is high value". So when I've explained to someone why their current direction is utterly insufficient, they usually won't deny some class of problems. They'll instead tell me that the research still seems valuable even though it isn't addressing a bottleneck, or that their research is maybe a useful part of a bigger solution which involves many other parts, or that their research is maybe useful step toward something better. (Though admittedly I usually try to "meet people where they're at", by presenting failure-modes which won't parse as weird to them. If you're just directly explaining e.g. dangers of internal RSI, I can see where people might instead just assume away internal RSI or some such.) ... and then if I were really putting in effort, I'd need to explain that e.g. being a useful part of a bigger solution (which they don't know the details of) is itself a rather difficult design constraint which they have not at all done the work to satisfy. But usually I wrap up the discussion well before that point; I generally expect that at most one big takeaway from a discussion can stick, and if they already have one then I don't want to overdo it.

8TsviBT4mo

This agrees with something like half of my experience. Right, I think of this response as arguing that streetlighting is a good way to do large-scale pre-paradigm science projects in general. And I have to somewhat agree with that. Then I argue that AGI alignment is somewhat exceptional: 1. cruel deadline, 2. requires understanding as-yet-unconceived aspects of Mind. Point 2 of exceptionality goes through things like alienness of creativity, RSI, reflective instability, the fact that we don't understand how values sit in a mind, etc., and that's the part that gets warped away. I do genuinely think that the 2024 field of AI alignment would eventually solve the real problems via collective iterative streetlighting. (I even think it would eventually solve it in a hypothetical world where all our computers disappeared, if it kept trying.) I just think it'll take a really long time. Right, exactly. (I wrote about this in my opaque gibberish philosophically precise style here: https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html#1-summary)

4Zach Stein-Perlman4mo

I wonder whether John believes that well-liked research, e.g. Fabien's list, is actually not valuable or rare exceptions coming from a small subset of the "alignment research" field.

8Buck4mo

I strongly suspect he thinks most of it is not valuable

7johnswentworth4mo

This is the sort of object-level discussion I don't want on this post, but I've left a comment on Fabien's list.

[-]debrevitatevitae4mo*194

Some broad points:

My interpretation of those at fault are the field builders and funders. That is part of the reason I quit doing alignment. The entire funding landscape feels incredibly bait and switch: come work for us! We are desperate for talent! The alignment problem is the hardest issue of the century! (Cue 2 years and an SBF later) Erm, no, we don't fund AI safety startups or interp, and we want to see tangible results in a few narrow domains...
In particular, I advocate for the concept of endorsing work with a nonlinear apparent progress rate. Call it 'slow work' or something. Often a lot of hard things look like they're getting nowhere but all the small failures add up to something big. This is also why I do not recommend MATS as a one size fits all solution for people joining the alignment field: some people do better with slow work, and with carefully thinking about where they are heading with direction and intent, not just putting their heads down to 'get something done'. In fact, this mindset gave me burnout earlier this year.
The people not at fault are those who are middle of the pack undergrads or not Physicists Doing Real Things. This is a system wide problem.

I... (read more)

[-]bilalchughtai4mo*1212

In fact, this mindset gave me burnout earlier this year.

I relate pretty strongly to this. I think almost all junior researchers are incentivised to 'paper grind' for longer than is correct. I do think there are pretty strong returns to having one good paper for credibility reasons; it signals that you are capable of doing AI safety research, and thus makes it easier to apply for subsequent opportunities.

Over the past 6 months I've dropped the paper grind mindset and am much happier for this. Notably, were it not for short term grants where needing to visibly make progress is important, I would have made this update sooner. Another take that I have is that if you have the flexibility to do so (e.g. by already having stable funding, perhaps via being a PhD student), front-loading learning seems good. See here for a related take by Rohin. Making progress on hard problems requires understanding things deeply, in a way which making progress on easy problems that you could complete during e.g. MATS might not.

[-]Seth Herd4mo1912

I think this lens of incentives and the "flinching away" concept are extremely valuable for understanding the field of alignment (and less importantly, everything else:).

I believe "flinching away" is the psychological tendency that creates bigger and more obvious-on-inspection "ugh fields". I think this is the same underlying mechanism discussed as valence by Steve Byrnes. Motivated reasoning is the name for the resulting cognitive bias. Motivated reasoning overlaps by experimental definition with confirmation bias, the one bias destroying society in Scott Alexander's terms. After studying cognitive biases through the lens of neuroscience for years, I th nk motivated reasoning is severely hampering progress in alignment, as it is every other project. I have written about it a little in what is the most important cognitive bias to understand, but I want to address more thoroughly how it impacts alignment research.

This post makes a great start at addressing how that's happening.

I very much agree with the analysis of incentives given here: they are strongly toward tangible and demonstrable progress in any direction vaguely related to the actual problem at hand.... (read more)

[-]Thane Ruthenis4mo169

I almost want to say that it sounds like we should recruit people from the same demographic as good startup founders. Almost.

Per @aysja's list, we want creative people with an unusually good ability to keep themselves on-track, who can fluently reason at several levels of abstraction, and who don't believe in the EMH. This fits pretty well with the stereotype of a successful technical startup founder – an independent vision, an ability to think technically and translate that technical vision into a product customers would want (i. e., develop novel theory and carry it across the theory-practice gap), high resilience in the face of adversity, high agency, willingness to believe you can spot an exploitable pattern where no-one did, etc.

... Or, at least, that is the stereotype of a successful startup founder from Paul Graham's essays. I expect that this idealized image diverges from reality in quite a few ways. (I haven't been following Silicon Valley a lot, but from what I've seen, I've not been impressed with all the LLM and LLM-wrapper startups. Which made me develop quite a dim image of what a median startup actually looks like.)

Still, when picking whom to recruit, it might be use... (read more)

[-]Noosphere894mo163

One particular way this issue could be ameliorated is by encouraging people to write up null results/negative results, and one part of your model here is that a null result doesn't get reported and thus other people don't hear about failure, while people do hear about success stories, meaning that there is a selection effect to work on successful programs, and no one hears about the failures to tackle the problem, which is bad for research culture, and negative results not being shown is a universal problem across fields.

[-]Seth Herd4mo108

Definitely.

Lack of publicly reporting null results was a subtle but huge problem in cognitive neuroscience. It took a while to figure out just how much effort was being wasted running studies that others had already tried and not reported because results were null.

Alignment doesn't have the same journal gatekeeping system that filters out null results, but there's probably a pretty strong tendency to report less on lack of progress than actual progress.

So post about it if you worked hard at something and got nowhere. This is valuable information when others choose their problems and approaches.

I do see people doing this; it would probably be valuable if we did it more.

[-]plex4mo140

I mostly agree with the diagnosis of the problem, but have some different guesses about paths to try and get alignment on track.

I think the core difficulties of alignment are explained semi-acceptably, but in a scattered form which means that only the dedicated explorers with lots of time and good taste end up finding them. Having a high quality course which collects the best explainers we have to prepare people for trying to find a toehold, and noticing the gaps left and writing good things added to fill them, seems necessary for any additional group of people added to actually point in the right direction.

BlueDot's course seems strongly optimized to funnel people into the empirical/ML/lab alignment team pipeline, they have dropped the Agent Foundations module entirely, and their "What makes aligning AI difficult?" fast track is 3/5ths articles on RLHF/RLAIF (plus an intro to LLMs and a RA video). This is the standard recommendation, and there isn't a generally known alternative.

I tried to fix this with Agent Foundations for Superintelligent Robust-Alignment, but I think this would go a lot better if someone like @johnswentworth took it over and polished it.

[-]jacquesthibs4mo110

Putting venues aside, I'd like to build software (like AI-aided) to make it easier for the physics post-docs to onboard to the field and focus on the 'core problems' in ways that prevent recoil as much as possible. One worry I have with 'automated alignment'-type things is that it similarly succumbs to the streetlight effect due to models and researchers having biases towards the types of problems you mention. By default, the models will also likely just be much better at prosaic-style safety than they will be at the 'core problems'. I would like to instead design software that makes it easier to direct their cognitive labour towards the core problems.

I have many thoughts/ideas about this, but I was wondering if anything comes to mind for you beyond 'dedicated venues' and maybe writing about it.

[-]dr_s4mo87

Generalize this story across a whole field, and we end up with most of the field focused on things which are easy, regardless of whether those things are valuable.

I would say this problem plagues more than just alignment, it plagues all of science. Trying to do everything as a series of individual uncoordinated contributions with an authority on top acting only to filter based on approximate performance metrics has this effect.

[-]abramdemski4mo80

(And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)

Mathematics?

4johnswentworth4mo

High variance. A lot of mathematics programs allow one to specialize in fairly narrow subjects IIUC, which does not convey a lot of general technical skill. I'm sure there are some physics programs which are relatively narrow, but my impression is that physics programs typically force one to cover a pretty wide volume of material.

[-]Chris_Leong4mo8-1

If you wanted to create such a community, you could try spinning up a Discord server?

9Stephen Fowler4mo

I'm not saying that this would necessarily be a step in the wrong direction, but I don't think think a discord server is capable of fixing a deeply entrenched cultural problem among safety researchers. If moderating the server takes up a few hours of John's time per week the opportunity cost probably isn't worth it.

4Caleb Biddulph4mo

Maybe someone else could moderate it?

[-]Purplehermann4mo80

A few thoughts.

Have you checked what happens when you throw physic postdocs at the core issues - do they actually get traction or just stare at the sheer cliff for longer while thinking? Did anything come out of the Illiad meeting half a year later? Is there a reason that more standard STEMs aren't given an intro into some of the routes currently thought possibly workable, so they can feel some traction? I think either could be true- that intelligence and skills aren't actually useful right now, the problem is not tractable, or better onboarding could

... (read more)

[-]Quinn4mo73

As someone who, isolated and unfunded, went on months-long excursions into the hard version of the pset multiple times and burned out each time, I felt extremely validated when you verbally told me a fragment of this post around a fire pit at illiad. The incentives section of this post is very grim, but very true. I know naive patches to the funding ecosystem would also be bad (easy for grifters, etc), but I feel very much like I and we were failed by funders. I could've been stronger etc, I could've been in berkeley during my attempts instead of philly, b... (read more)

[-]Johannes C. Mayer4mo6-1

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems.

I spend ~10 hours trying to teach people how to think. I sometimes try to intentionally cause this to happen. Usually you can recognize it by them starting to be quiet (I usually give the instruction that they should do all their thinking out loud). And this seems to be when actual cognitive labor is happening, instead of saying things that y... (read more)

[-]RogerDearnaley3mo50

I have a Theoretical Physics PhD in String Field Theory from Cambridge — my reaction to hard problems is to try to find a way of cracking them that no-one else is trying. Please feel free to offer to fund me :-)

[-]Christopher King4mo50

I think there is an obvious signal that could be used: a forecast of how much MIRI will like the research when asked in 5 years. (Note that I don't mean just asking MIRI now, but rather something like prediction markets or super-forecasters to predict what MIRI will say 5 years from now.)

Basically, if the forecast is above average, anyone who trusts MIRI should fund them.

9Ryan Kidd4mo

But who is "MIRI"? Most of the old guard have left. Do you mean Eliezer and Nate? Or a consensus vote of the entire staff (now mostly tech gov researchers and comms staff)?

6Thane Ruthenis4mo

Hm. Eliezer has frequently complained that the field has no recognition function for good research he's satisfied with besides "he personally looks at the research and passes his judgement", and that this obviously doesn't scale. Stupid idea: Set up a grantmaker that funds proposals based on a prediction market tasked with evaluating how likely Eliezer/Nate/John is to approve of a given research project. Each round, after the funding is assigned to the highest-credence projects, Eliezer/Nate/John evaluate a random subset of proposals to provide a ground-truth signal; the corresponding prediction markets pay out, the others resolve N/A. This should effectively train a reward function that emulates the judges' judgements in a scalable way. Is there an obvious reason this doesn't work? (One possible issue is the amount of capital that'd need to be frozen in those markets by market participants, but we can e. g. upscale the effective amounts of money each participant has as some multiple of the actual dollars invested, based on how many of their bets are likely to actually pay out.)

[-]ryan_greenblatt4mo166

Some notes:

I don't think this is the actual bottleneck here. Noteably, Eliezer, Nate, and John don't spend much of any of their time assessing research at all (at least recently) as far as I can tell.
I don't think a public market will add much information. Probably better to just have grantmakers with more context forecast and see how well they do. You need faster feedback loops than 1 yr to get anywhere though, but you can do this by practicing on a bunch of already done research.
My current view is that more of the bottleneck in grantmaking is not having good stuff to fund rather than grantmakers not funding stuff, though I do still think the Open Phil should fund notably more aggressively than they currently do, that marginal LTFF dollars look great, and that it's bad that Open Phil was substantially restricted in what they can fund recently (which I expect to have substantial chilling effects in addition to those areas).

9Thane Ruthenis4mo

* Perhaps not specific research projects, but they've communicated a lot regarding their models of what types of research are good/bad. (See e. g. Eliezer's list of lethalities, John's Why Not Just... sequence, this post of Nate's.) * I would assume this is because this doesn't scale and their reviews are not, in any given instance, the ultimate deciding factor regarding what people do or what gets funded. Spending time evaluating specific research proposals is therefore cost-inefficient compared to reviewing general research trends/themes. Because no entity that I know of is currently explicitly asking for proposals that Eliezer/Nate/John would fund. Why would people bother coming up with such proposals in these circumstances? The system explicitly doesn't select for it. I expect that if there were an actual explicit financial pressure to goodhart to their preferences, much more research proposals that successfully do so would be around.

[-]catubc3mo30

Exactly, and thanks for writing this.

I would go further and say that - AI safety is AI dev - and this happened years ago. If we stopped it all now, we'd extend our timelines:

https://www.lesswrong.com/posts/vkzmbf4Mve4GNyJaF/the-case-for-stopping-ai-safety-research

[-]Lorec4mo30

Last big piece: if one were to recruit a bunch of physicists to work on alignment, I think it would be useful for them to form a community mostly-separate from the current field. They need a memetic environment which will amplify progress on core hard problems, rather than... well, all the stuff that's currently amplified.

Yes, exactly. Unfortunately, actually doing this is impossible, so we all have to keep beating our heads against a wall just the same.

[-]Daniel Tan4mo30

To all those researchers: yup, from your perspective I am in fact being very unfair, and I'm sorry. You are not the intended audience of this post

I'm curious, what's the distinction between "those researchers" and the actual intended audience of your post? Who is this meant to convince, if not the people you claim are being 'stupid internally'?

5johnswentworth4mo

This post isn't intended to convince anyone at all that people are in fact streetlighting. This post is intended to present my own models and best guesses at what to do about it to people who are already convinced that most researchers in the field are streetlighting. They are the audience.

[-]Oliver Daniels4mo3-2

Yeah it does seem unfortunate that there's not a robust pipeline for tackling the "hard problem" (even conditioned to more "moderate" models of x-risk)

But (conditioned on "moderate" models) there's still a log of low-hanging fruit that STEM people from average universities (a group I count myself among) can pick. Like it seems good for Alice to bounce off of ELK and work on technical governance, and for Bob to make incremental progress on debate. The current pipeline/incentive system is still valuable, even if it systematically neglects tackling the "hard problem of alignment".

[-]Logan Zoellner4mo2-4

A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is".

I've always been sympathetic to the drunk in this story. If the key is in the light, there is a chance of finding it. If it is in ... (read more)

[-]Joel Burget4mo*21

A different way to think about types of work is within current ML paradigms vs outside of them. If you believe that timelines are short (e.g. 5 years or less), it makes much more sense to work within current paradigms, otherwise there's very little chance your work will become adopted in time to matter. Mainstream AI, with all of its momentum, is not going to adopt a new paradigm overnight.

If I understand you correctly, there's a close (but not exact) correspondence between work I'd label in-paradigm and work you'd label as "streetlighting". On my model th... (read more)

[-]Signer4mo1-1

But it is important, and this post just isn’t going to get done any other way.

Speaking about streetlighting...

Moderation Log