Eliezer Yudkowsky periodically complains about people coming up with questionable plans with questionable assumptions to deal with AI, and then either:

  • Saying "well, if this assumption doesn't hold, we're doomed, so we might as well assume it's true."
  • Worse: coming up with cope-y reasons to assume that the assumption isn't even questionable at all. It's just a pretty reasonable worldview.

Sometimes the questionable plan is "an alignment scheme, which Eliezer thinks avoids the hard part of the problem." Sometimes it's a sketchy reckless plan that's probably going to blow up and make things worse.

Some people complain about Eliezer being a doomy Negative Nancy who's overly pessimistic.

I had an interesting experience a few months ago when I ran some beta-tests of my Planmaking and Surprise Anticipation workshop, that I think are illustrative.


i. Slipping into a more Convenient World

I have an exercise where I give people the instruction to play a puzzle game ("Baba is You"), but where you normally have the ability to move around and interact with the world to experiment and learn things, instead, you need to make a complete plan for solving the level, and you aim to get it right on your first try.

In the exercise, I have people write down the steps of their plan, and assign a probability to each step. 

If there is a part of the puzzle-map that you aren't familiar with, you'll have to make guesses. I recommend making 2-3 guesses for how a new mechanic might work. (I don't recommend making a massive branching tree for every possible eventuality. For the sake of the exercise not taking forever, I suggest making 2-3 branching path plans)

Several months ago, I had three young-ish alignment researchers do this task (each session was a 1-1 with just me and them).

The participants varied in how experienced they were with Baba is You. Two of them were new to the game, and completed the the first couple levels without too much difficulty, and then got to a harder level. The third participant had played a bit of the game before, and started with a level near where they had left off.

Each of them looked at their level for awhile and said "Well, this looks basically impossible... unless this [questionable assumption I came up with that I don't really believe in] is true. I think that assumption is... 70% likely to be true."

Then they went an executed their plan.

It failed. The questionable assumption was not true.

Then, each of them said, again "okay, well here's a different sketchy assumption that I wouldn't have thought was likely except if it's not true, the level seems unsolveable."

I asked "what's your probability for that one being true?"

"70%"

"Okay. You ready to go ahead again?" I asked.

"Yep", they said.

They tried again. The plan failed again.

And, then they did it a third time, still saying ~70%.

This happened with three different junior alignment researchers, making a total of 9 predictions, which were wrong 100% of the time.

(The third guy, on the the second or third time, said "well... okay, I was wrong last time. So this time let's say it's... 60%.")


My girlfriend ran a similar exercise with another group of young smart people, with similar results. "I'm 90% sure this is going to work" ... "okay that didn't work."


Later I ran the exercise again, this time with a mix of younger and more experienced AI safety folk, several of whom leaned more pessimistic. I think the group overall did better. 

One of them actually made the correct plan on the first try. 

One them got it wrong, but gave an appropriately low estimate for themselves.

Another of them (call them Bob) made three attempts, and gave themselves ~50% odds on each attempt. They went into the experience thinking "I expect this to be hard but doable, and I believe in developing the skill of thinking ahead like this." 

But, after each attempt, Bob was surprised by how out-of-left field their errors were. They'd predicted they'd be surprised... but they were surprised in surprising ways – even in a simplified, toy domain that was optimized for being a solveable puzzle, where they had lots of time to think through everything. They came away feeling a bit shook up by the experience, and not sure if they believed in longterm planning at all, and feeling a bit alarmed at a lot of people around who confidently talked as if they were able to model things multiple steps out.


ii. Finding traction in the wrong direction.

A related (though distinct) phenomena I found, in my own personal experiments using Baba Is You, or Thinking Physics, or other puzzle exercises as rationality training:

It's very easy to spend a lot of time optimizing within the areas I feel some traction, and then eventually realize this was wasted effort. A few different examples:

Forward Chaining instead of Back Chaining. 

In Baba-is-You levels, there will often be parts of the world that are easy to start fiddling around with and manipulating, and maneuvering into a position that looks like it'll help you navigate the world. But, often, these parts are red herrings. They open up your option-space within the level... but not the parts you needed to win. 

It's often faster to find the ultimately right solution if you're starting from the end and backchaining, rather than forward chaining with whatever bits are easiest to fiddle around with.

Moving linearly, when you needed to be exponential. 

Often in games I'll be making choices that improve my position locally, and are clearly count as some degree of "progress." I'll get 10 extra units of production, or damage. But, then I reach the next stage, and it turns out I really needed 100 extra units to survive. And the thought-patterns that would have been necessary to "figure out how to get 100 units" on my first playthrough are pretty different from the ones I was actually doing. 

It should have occurred to me to ask "will the game ever throw a bigger spike in difficulty at me?", and "is my current strategy of tinkering around going to prepare me for such difficulty?".

Doing lots of traction-y-feeling reasoning that just didn't work. 

On my first Thinking Physics problem last year, I brainstormed multiple approaches to solving the problem, and tried each of them. I reflected on considerations I might have missed, and then incorporated them. I made models and estimations. It felt very productive and reasonable.

I got the wrong answer, though.

My study partner did get the right answer. Their method was more oriented around thought experiments. And in retrospect their approach seemed more useful for this sort of problem. And it's noteworthy that my subjective feeling of "making progress" didn't actually correspond to making the sort of progress that mattered.


Takeaways

Obviously, an artificial puzzle is not the same as a real, longterm research project. Some differences include:

  • It's designed to be solveable
  • But, also, it's designed to be sort of counterintuitive and weird
  • It gives you a fairly constrained world, and tells you what sort of questions you're trying to ask.
  • It gives you clear feedback when you're done.

Those elements push in different directions. Puzzles are more deliberately counterintuitive than reality is, on average, so it's not necessarily "fair" when you fall for a red herring. But they are nonetheless mostly easier and clearer than real science problem.

What I found most interesting was people literally saying the words out loud, multiple times "Well, if this [assumption] isn't true, then this is impossible" (often explicitly adding "I wouldn't [normally] think this was that likely... but..."). And, then making the mental leap all the way towards "70% that this assumption is true." Low enough for some plausible deniability, high enough to justify giving their plan a reasonable likelihood of success.

It was a much clearer instance of mentally slipping sideways  into a more convenient world, than I'd have expected to get.

I don't know if the original three people had done calibration training of any kind beforehand.  I know my own experience doing the OpenPhil calibration game was that I got good at it within a couple hours... but that it didn't transfer very well to when I started making PredictionBook / Fatebook questions about topics I actually cared about.

I expect forming hypotheses in a puzzle game to be harder than the OpenPhil Calibration game, but easier than making longterm research plans. It requires effort to wrangle your research plans into a bet-able form, and then actually make predictions about it. I bet most people do not do that. 

Now, I do predict that people who do real research in a given field will get at least decent at implicitly predicting research directions within their field (via lots of trial-and-error, and learning from mentors). This is what "research taste" is. But, I don't think this is that reliable if you're not deliberately training your calibration. (I have decades of experience passively predicting stuff happening in my life, but I nonetheless was still miscalibrated when I first started making explicit PredictionBook predictions about them). 

And moreover, I don't think this transfers much to new fields you haven't yet mastered. Stereotypes come to mind of brilliant physicists who assume their spherical-cow-simplifications will help them model other fields. 

This seems particularly important for existentially-relevant alignment research. We have examples of people who have demonstrated "some kind of traction and results" (for example, doing experiments on modern ML systems. Or, for that matter, coming up with interesting ideas like Logical Induction). But we don't actually have direct evidence that this productivity will be relevant to superintelligent agentic AI. 

When it comes to "what is good existential safety research taste?", I think we are guessing.

I think you should be scared about this, if you're the sort of theoretic researcher, who's trying to cut at the hardest parts of the alignment problem (whose feedback loops are weak or nonexistent) 

I think you should be scared about this, if you're the sort of Prosaic ML researcher who does have a bunch of tempting feedback loops for current generation ML, but a) it's really not clear whether or how those apply to aligning superintelligent agents, b) many of those feedback loops also basically translate into enhancing AI capabilities and moving us toward a more dangerous world.

I think you should be scared about this, if you're working in policy, either as a research wonk or an advocate, where there are some levers of power you can sort-of-see, but how the levers fit together and whether they actually connect to longterm existential safety is unclear.

Unfortunately, "be scared" isn't that useful advice. I don't have a great prescription for what to do.

My dissatisfaction with this situation is what leads me to explore Feedbackloop-first Rationality, basically saying "Well the problem is our feedback loops suck – either they don't exist, or they are temptingly goodharty. Let's try to invent better ones." But I haven't yet achieved an outcome here I can point to and say "okay this clearly helps."

But, meanwhile, my own best guess is:

I feel a lot more hopeful about researchers who have worked on a few different types of problems, and gotten more calibrated on where the edges of their intuitions' usefulness are. I'm exploring the art of operationalizing cruxy predictions, because I hope that can eventually feed into an the art of having calibrated, cross-domain research taste, if you are deliberately attempting to test your transfer learning. 

I feel more hopeful about researchers that make lists of their foundational assumptions, and practiced of staring into the abyss, confronting "what would I actually do if my core assumptions were wrong, and my plan doesn't work?", and grieving for assumptions that seem, on reflection, to have been illusions.

I feel more hopeful about researchers who talk to mentors with different viewpoints, learning different bits of taste and hard-earned life lessons, and attempt to integrate them into some kind of holistic AI safety research taste.

And while I don't think it's necessarily right for everyone to set themselves the standard of "tackle the hardest steps in the alignment problem and solve it in one go", I feel much more optimistic about people who have thought through "what are all the sorts of things that need to go right, for my research to actually pay off in an existential safety win?"

And I'm hopeful by people who look at all of this advice, and think "well, this still doesn't actually feel sufficient for me to be that confident my plans are really going to accomplish anything", and set out to brainstorm new ways to shore up their chances.

New Comment
47 comments, sorted by Click to highlight new comments since:

It seems important to note that the posited reasoning flaw is much more rational in the context of a puzzle game. In that context, you know that there's a correct answer. If you've exhaustively searched through the possibilities, and you're very confident that there's no solution unless assumption X holds...that's actually a pretty good reason to believe X. 

A syllogism: 

P1: There is a solution.

P2: If not X, there would be no solution.

C: Therefore X.

In this context, subjects seem to be repeatedly miscalibrated about P2. It turns out that they had searched through the space of possibilities much less exhaustively than they thought they had. So they were definitely wrong, overall. 

But this syllogism is still valid. And P1 is a quite reasonable assumption. 

There's an implication that people are applying an identical pattern of reasoning to plans for AI risk. In that context, not only is P2 non-obvious for any given version of P2, but P1 isn't solid either: there's no guarantee that there's a remotely feasible solution in scope, given the available resources. 

In that case, if indeed people are reasoning this way, they're making two mistakes, both PI and P2, whereas in puzzle game context, people are only really making one.


From my observations of people thinking about AI risk, I also think that people are (emotionally) assuming that there's a viable solution AI risk, and their thinking warps accordingly.

But it seems important to note that if we're drawing the conclusion that people make that particular error, from this dataset, you're attributing a reasoning error to them in one context, based on observing them use a similar assumption in a different context in which that assumption is, in fact, correct.

Yes they're correct in assuming success is possible in those situations - but their assumption of the possible routes to success is highly incorrect. People are making a very large error in overestimating how well they understand the situation and failing to think about other possibities. This logical error sounds highly relevant to alignment and AI risk.

I agree. 

Yeah. I tried to get at this in the Takeaways but I like your more thorough write up here.

Eliezer Yudkowsky periodically complains about people coming up with questionable plans with questionable assumptions to deal with AI, and then either:

  • Saying "well, if this assumption doesn't hold, we're doomed, so we might as well assume it's true."
  • Worse: coming up with cope-y reasons to assume that the assumption isn't even questionable at all. It's just a pretty reasonable worldview.

Sometimes the questionable plan is "an alignment scheme, which Eliezer thinks avoids the hard part of the problem." Sometimes it's a sketchy reckless plan that's probably going to blow up and make things worse.

Some people complain about Eliezer being a doomy Negative Nancy who's overly pessimistic.

This is somewhat beyond the scope of the primary point I think you are trying to make in this post, but I suspect this recurring dynamic (which I have observed myself a number of times) is more caused by (and logically downstream of) an important object-level disagreement about the correctness of the following claim John Wentworth makes in one of the comments on his post on how "Most People Start With The Same Few Bad Ideas":

I think a lot of people don't realize on a gut level that a solution which isn't robust is guaranteed to fail in practice. There are always unknown unknowns in a new domain; the presence of unknown unknowns may be the single highest-confidence claim we can make about AGI at this point. A strategy which fails the moment any surprise comes along is going to fail; robustness is necessary. Now, robustness is not the same as "guaranteed to work", but the two are easy to confuse. A lot of arguments of the form "ah but your strategy fails in case X" look like they're saying "the strategy is not guaranteed to work", but the actually-important content is "the strategy is not robust to <broad class of failures>"; the key is to think about how broadly the example-failure generalizes. (I think a common mistake newcomers make is to argue "but that particular failure isn't very likely", without thinking about how the failure mode generalizes or what other lack-of-robustness it implies.)

(Bolding is mine.) It seems to me that, at least in a substantial number of cases in which such conversations between Eliezer and some other alignment researcher who presents a possible solution to the alignment problem, Eliezer is trying to get this idea across and the other person genuinely disagrees[1]. To give a specific example of this dynamic that seems illustrative, take Quintin Pope's comment on the same post:

The issue is that it's very difficult to reason correctly in the absence of an "Official Experiment"[1]. I think the alignment community is too quick to dismiss potentially useful ideas, and that the reasons for those dismissals are often wrong. E.g., I still don't think anyone's given a clear, mechanistic reason for why rewarding an RL agent for making you smile is bound to fail (as opposed to being a terrible idea that probably fails).

[1] More precisely, it's very difficult to reason correctly even with many "Official Experiments", and nearly impossible to do so without any such experiments.

  1. ^

    As opposed to simply being unaware of this issue and becoming surprised when Eliezer brings it up.

Plans obviously need some robustness to things going wrong, and in a sense I agree with John Wentworth, if weakly, that some robustness is a necessary feature of a plan, and some verification is actually necessary.

But I have to agree that there is a real failure mode identified by moridinamael and Quintin Pope, and that is perfectionism, meaning that you discard ideas too quickly as not useful, and this constraint is the essence of perfectionism:

I have an exercise where I give people the instruction to play a puzzle game ("Baba is You"), but where you normally have the ability to move around and interact with the world to experiment and learn things, instead, you need to make a complete plan for solving the level, and you aim to get it right on your first try.

It asks for both a complete plan to solve the whole level, and also asks for the plan to work on the first try, which outside of this context implies either the problem is likely unsolvable or you are being too perfectionist with your demands.

In particular, I think that Quintin Pope's comment here is genuinely something that applies in lots of science and problem solving, and that it's actually quite difficult to reasoin well about the world in general without many experiments.

I think the concern is that, if plans need some verification, it may be impossible to align smarter-than-human AGI. In order to verify those plans, we'd have to build one. If the plan doesn't work (isn't verified), that may be the end of us - no retries possible.

There are complex arguments on both sides, so I'm not arguing this is strictly true. I just wanted to clarify that that's the concern and the point of asking people to solve it on the first try. I think ultimately this is partly but not 100% true of ASI alignment, and clarifying exactly how and to what degree we can verify plans empirically is critical to the project. Having a plan verified on weak, non-autonomous/self-aware/agentic systems may or may not generalize to that plan working on smarter systems with those properties. Some of the ways verification will or won't generalize can probably be identified with careful analysis of how such systems will be functionally different.

https://arxiv.org/abs/1712.05812

It's directly about inverse reinforcement learning, but that should be strictly stronger than RLHF. Seems incumbent on those who disagree to explain why throwing away information here would be enough of a normative assumption (contrary to every story about wishes.)

[-]dirk7-2

What I found most interesting was people literally saying the words out loud, multiple times "Well, if this [assumption] isn't true, then this is impossible" (often explicitly adding "I wouldn't [normally] think this was that likely... but..."). And, then making the mental leap all the way towards "70% that this assumption is true." Low enough for some plausible deniability, high enough to justify giving their plan a reasonable likelihood of success.

It was a much clearer instance of mentally slipping sideways  into a more convenient world, than I'd have expected to get.

I think this tendency might've been exaggerated by the fact that they were working on a puzzle game; they know the levels are in fact solvable, so if a solution seems impossible it's almost certainly the case that they've misunderstood something.

Yes, your prior on the puzzle being actually unsolvable should be very low, but in almost all such situations, 70% seems way too high a probability to assign to your first guess at what you've misunderstood.

When I prove to myself that a puzzle is impossible (given my beliefs about the rules), that normally leads to a period of desperate scrambling where I try lots of random unlikely crap just-in-case, and it's rare (<10%) that anything I try during that desperate scramble actually works, let alone the first thing.

In the "final" level of Baba Is You, I was stuck for a long time with a precise detailed plan that solved everything in the level except that there was one step in the middle of the form "and then I magically get past this obstacle, somehow, even though it looks impossible."  I spent hours trying to solve that one obstacle.  When I eventually beat the level, of course, it was not by solving that obstacle--it was by switching to a radically different approach that solved several other key problems in entirely different ways.  In hindsight, I feel like I should have abandoned that earlier plan much sooner than I did.

In mitigation:  I feel that solutions in Baba Is You are significantly harder to intuit than in most puzzle games.

Yes, but they failed with much higher than 30% likelihood even still. So the tendency you are pointing out wasn't helping much. I do agree that it would be a better test to make multiple levels using the level editor, and have some of them literally be unsolvable. Then have someone have the choice of writing out a solution or writing out a proof for why there could be no solution. I think you'd find an even lower success rate in this harder version.

[-]Jay60

FYI, the green energy field is much, much worse in this regard than the AI field.

Curious for details.

[-]Jay5-2

Almost everyone believes one and only one of the following statements:

  1. We have, or can soon develop, the technology to power society with clean energy at a modern standard of living, or
  2. Global warming is not a major problem (e.g. it won't damage agricultural productivity to famine-inducing levels).

Logically, these statements have nothing to do with each other.  Either, neither, or both could be correct (I suspect neither).  They are, in the words of the post, questionable assumptions.

See also Beyond the Peak – Ecosophia and Germany's economy struggles with an energy shock that's exposing longtime flaws | AP News.

[+][comment deleted]10

In your Baba-Is-You methodology, were test takers able to play through tutorial levels normally before they switched to plan-in-advance? Did you ask them if they had any familiarity with the game beforehand?

People varied in how much Baba-Is-You experience they had. Some of them were completely new, and did complete the first couple levels (which are pretty tutorial-like) using the same methodology I outline here, before getting to a level that was a notable challenge.

They actually did complete the first couple levels successfully, which I forgot when writing this post. This does weaken the rhetorical force, but also, the first couple levels are designed more to teach the mechanics and are significantly easier. I'll update the post to clarify this.

Some of them had played before, and were starting a new level from around where they left off.

These are interesting anecdotes but it feels like they could just as easily be used to argue for the opposite conclusion.

That is, your frame here is something like "planning is hard therefore you should distrust alignment plans".

But you could just as easily frame this as "abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments".

Also, the second section makes an argument in favor of backchaining. But that seems to contradict the first section, in which people tried to backchain and it went badly. The best way for them to make progress would have been to play around with a bunch of possibilities, which is closer to forward-chaining.

(And then you might say: well, in this context, they only got one shot at the problem. To which I'd say: okay, so the most important intervention is to try to have more shots on the problem. Which isn't clearly either forward-chaining or back-chaining, but probably closer to the former.)

These are interesting anecdotes but it feels like they could just as easily be used to argue for the opposite conclusion.

That is, your frame here is something like "planning is hard therefore you should distrust alignment plans".

But you could just as easily frame this as "abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments".

That doesn't sound right to me. 

The reported observation is not just that these particular people people failed at a planning / reasoning task. The reported observation is that they repeatedly made optimistic miscalibrated assumptions, because those assumptions supported a plan

There's a more specific reasoning error that's being posited, beyond "people are often wrong when trying to reason about abstract domains without feedback". Something like "people will anchor on ideas, if those ideas are necessary for the success of a plan, and they don't see an alternative plan."

If that posit is correct, that's not just an update of "reasoning abstractly is hard and we should widen our confidence intervals / be more uncertain". We should update to having a much higher evidential bar for the efficacy of plans.

[-]Raemon168

I had a second half of this essay that felt like it was taking too long to pull together and I wasn't quite sure who I was arguing with. I decided I'd probably try to make it a second post. I generally agree it's not that obvious what lessons to take.

The beginning of the second-half/next-post was something like:

There's an age-old debate about AI existential safety, which I might summarize as the viewpoints:

1. "We only get one critical try, and most alignment research dodges the hard part of the problem, with wildly optimistic assumptions."

vs

2. "It is basically impossible to make progress on remote, complex problems on your first try. So, we need to somehow factor the problem into something we can make empirical progress on."

I started out mostly thinking through lens #1. I've updated that, actually, both views are may be "hair on fire" levels of important. I have some frustrations with both some doomer-y people who seem resistant to incorporating lens #2, and with people who seem to (in practice) be satisfied with "well, iterative empiricism seems tractable, and we don't super need to incorporate frame #1)

I am interested in both:

  • trying to build "engineering feedback loops" that more accurately represent the final problem as best we can, and then iterating on both "solving representative problems against our current best engineered benchmarks" while also "continuing to build better benchmarks. (Automating Auditing and Model Organisms of Misalignment seem like attempts at this)
  • trying to develop training regimens that seem like they should help people plan better in Low-Feedback-Domains, which includes theoretic work, and empirical research that's trying to keep their eye on the longterm ball better, and the invention of benchmarks a la previous bullet.

That is, your frame here is something like "planning is hard therefore you should distrust alignment plans".

But you could just as easily frame this as "abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments".

I think received wisdom in cryptography is "don't roll your own crypto system".  I think this comes from a bunch of overconfident people doing this and then other people discovering major flaws in what they did, repeatedly.

The lesson is not "Reasoning about a crypto system you haven't built yet is hard, and therefore it's equally reasonable to say 'a new system will work well' and 'a new system will work badly'."  Instead, it's "Your new system will probably work badly."

I think the underlying model is that there are lots of different ways for your new crypto system to be flawed, and you have to get all of them right, or else the optimizing intelligence of your rivals (ideally) or the malware industry (less ideally) will find the security hole and exploit it.  If there are ten things you need to get right, and you have a 30% chance of screwing up each one, then the chance of complete success is 2.8%.  Therefore, one could say that, if there's a general fog of "It's hard to reason about these things before building them, such that it's hard to say in advance that the chance of failure in each thing is below 30%", then that points asymmetrically towards overall failure.

I think Raemon's model (pretty certainly Eliezer's) is indeed that an alignment plan is, in large parts, like a security system in that there are lots of potential weaknesses any one of which could torpedo the whole system, and those weaknesses will be sought out by an optimizing intelligence.  Perhaps your model is different?

Also, the second section makes an argument in favor of backchaining. But that seems to contradict the first section, in which people tried to backchain and it went badly.

This didn't come across in the post, but – I think people in the experiment were mostly doing things closer to (simulated) forward chaining, and then getting stuck, and then generating the questionable assumptions. (which is also what I tended to do when I first started this experiment). 

An interesting thing I learned is that "look at the board and think without fiddling around" is actually a useful skill to have even when I'm doing the more openended "solve it however seems best." It's easier to notice now when I'm fiddling around pointlessly instead of actually doing useful cognitive work.

But you could just as easily frame this as "abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments".

But doesn't this argument hold with the opposite conclusion, too? E.g. "abstract reasoning about unfamiliar domains is hard therefore you should distrust arguments about good-by-default worlds". 

Yepp, all of these arguments can weigh in many different directions, depending on your background beliefs. That's my point.

No, failure in the face of strong active hostile optimization pressure is different from failure when acting to overcome challenges in a neutral universe that just obeys physical laws. Building prisons is a different challenge from building bridges. Especially in the case where you know that there will be agents both inside and outside the prison that very much want to get through. Security is inherently a harder problem. A puzzle which has been designed to be challenging to solve and to have a surprising unintuitive solution is thus a better match for designing a security system. I bet you'd have much higher success rate of 'first plans' if the game were one of those bridge building games.

Yes to your first point. I think that

abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments

Is a fair characterization of those results. So would be the inverse, "abstract reasoning about unfamiliar domains is hard therefore you should distrust AI success arguments".

I think both are very true, and so we should distrust both. We simply don't know. 

I think the conclusion taken,

planning is hard therefore you should distrust alignment plans

Is also valid and true. 

People just aren't as smart as we'd like to think we are, particularly in reasoning about complex and unfamiliar domains. So both our plans and evaluations of them tend to be more untrustworthy than we'd like to think. Planning and reasoning require way more collective effort than we'd like to imagine. Careful studies of both individual reasoning in lab tasks and historical examples support this conclusion.

One major reason for this miscalibration is the motivated reasoning effect. We tend to believe what feels good (predicts local reward). Overestimating our reasoning abilities is one such belief among vary many examples of motivated reasoning.

I don't think it's unreasonable to distrust doom arguments for exactly this reason?

Yes, I'm saying it's a reasonable conclusion to draw, and the fact that it isn't drawn here is indicative of a kind of confirmation bias.

So, I do think definitely I've got some confirmation bias here – I know because the first thing I thought when I saw was "man this sure looks like the thing Eliezer was complaining about" and it was awhile later, thinking it through, that was like "this does seem like it should make you really doomy about any agent-foundations-y plans, or other attempts to sidestep modern ML and cut towards 'getting the hard problem right on the first try.'"

I did (later) think about that a bunch and integrate it into the post.

I don't know whether I think it's reasonable to say "it's additionally confirmation-bias-indicative that the post doesn't talk about general doom arguments." As Eli says, the post is mostly observing a phenonenon that seems more about planmaking than general reasoning. 

(fwiw my own p(doom) is more like 'I dunno man, somewhere between 10% and 90%, and I'd need to see a lot of things going concretely right before my emotional center of mass shifted below 50%')

Thanks for clarifying. I do agree with the broader point that one should have a sort of radical uncertainty about (e.g.) a post AGI world. I'm not sure I agree it's a big issue to leave that out of any given discussion though, since it shifts probability mass from any particular describable outcome to the big "anything can happen" area. (This might be what people mean by "Knightian uncertainty"?)

What I take away from this is that they should have separated the utility from an assumption being true, from the probability/likelihood of an assumption being true, and indeed this shows some calibration problems.

There is slipping into more convenient worlds for reasons based on utility rather than evidence, which is a problem (assuming it's solvable for you.)

This is an important takeaway, but I don't think your other takeaways help as much as this one.

That said, this constraint IRL makes almost all real-life problems impossible for humans and AIs:

I have an exercise where I give people the instruction to play a puzzle game ("Baba is You"), but where you normally have the ability to move around and interact with the world to experiment and learn things, instead, you need to make a complete plan for solving the level, and you aim to get it right on your first try.

In particular, if such a constraint exists, then it's a big red flag that the problem you are solving is impossible to solve, given that constraint.

Almost all plans fail on the first try, even for really competent plans and humans, and outside of very constrained regimes, 0 plans work out on the first try.

Thus, if you are truly in a situation where you are encountering such constraints, you should give up on the problem ASAP, and rest a little to make sure that the constraint actually exists.

So while this is a fun experiment, with real takeaways, I'd warn people that constraining a plan to work on the first try and requiring completeness makes lots of problems impossible to solve for us humans and AIs.

One of Eliezer's essays in The Sequences is called Shut Up and Do the Impossible

I'm confident Eliezer would agree with you that if you can find a way to do something easier instead, you should absolutely do that.  But he also argues that there is no guarantee that something easier exists; the universe isn't constrained to only placing fair demands on you.

My point isn't that the easier option always exists, or even that a problem can't be impossible.

My point is that if you are facing a problem that requires 1-shot complete plans, and there's no second try, you need to do something else.

There is a line where a problem becomes too difficult to productively work on, and that constraint is a great sign of an impossible problem (if it exists.)

The maximum difficulty that is worth attempting depends on the stakes.

AND your accurate assessment of the difficulty. The overconfidence displayed in this mini-experiment seems to result in part from people massively misestimating the difficulty of this relatively simple problem. That's why it's so concerning WRT alignment.

Do things like major surgery or bomb defusal have those kinds of constraints?

Not really, but they are definitely more few-shot than other areas, but thankfully getting 1 thing wrong isn't usually an immediate game-ender (though it is still to be avoided, and importantly this is why these 2 areas are harder than a lot of other fields).

Ah- well said. I understand the rest of your comments better now. And I thoroughly agree, with a caveat about the complexity of the problem and the amount of thought and teamwork applied (e.g., I expect that a large team working for a month in effective collaboration would've solved the problem in this experiment, but alignment is probably much more difficult than that).

[-]pom30

I am slightly intrigued by this game, as it seems to look like how I approach things in general when I have enough time and information to do so. Just an interesting aside, I like that people are also doing these kinds of things after all, or at least attempting to. As I am by no means "successful" with my method, it does seem to be the only way I can get myself to undertake something when it is important to me that I could at least have a somewhat realistic chance at succeeding.

I'll get 10 extra units of production, or damage. But, then I reach the next stage, and it turns out I really needed 100 extra units to survive.

I'm having a hard time thinking of any of my personal experiences that match this pattern, and would be interested to hear a couple examples.

(Though I can think of several experiences along the lines of "I got 10 points of armor, and then it turned out the next stage had a bunch of attacks that ignore armor, so armor fundamentally stops working as a strategy."  There are a lot of games where you just don't have the necessary information on your first try.)

Games I was particularly thinking of were They Are Billions, Slay The Spire. I guess also Factorio although the shape of that is a bit different.

(to be clear, these are fictional examples that don't necessarily generalize, but, when I look at the AI situation I think it-in-particular has an 'exponential difficulty' shape)

I would add Balatro (especially endless mode) to the list

I haven't played They Are Billions.  I didn't have that experience in Slay the Spire, but I'd played similar games before.  I suppose my first roguelike deckbuilder did have some important combo-y stuff that I didn't figure out right away, although that game was basically unwinnable on your first try for a bunch of different reasons.

"Slipping into a more convenient world" is a good way of putting it; just using the word "optimism" really doesn't account for how it's pretty slippy, nor how the direction is towards a more convenient world.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

I love this post.

I think you forgot to mention an important prerequisite. It was reasonable to assume the prereq but still worth mentioning I think. You should be looking for the W, for the clear win. It's easy to just fart around and forget you were trying to make something happen. And in real life there are often bigger and clearer wins available than is immediately apparent. Often this takes much time and energy and creativity to see. Often the more important/urgent problem can be easier to solve. People tend to love tricky things and puzzles. It can be hard to learn to love easy victory. Pop the soccer-ball! Stab your opponent in the back while they sleep on Christmas night! Replace your sails with diesel motors! Solve your integral numerically! Use steel instead of wood! — This lesson of course well known here but it's hard to have a consuming conversation or intriguing post about it. I often forget this.

Sometimes, you might as well solve the Rubix Cube by peeling the stickers off and sticking them back on.

TvTropes calls this Cutting the Knot, after the story of Alexander and the Gordian Knot.