Reviews 2022

Sorted by New

The consensus goals strongly needs rethinking imo. This is a clear and fairly simple start at such an effort. Challenging the basics matters. 

this post made me understand something i did not understand before that seems very important. important enough that it made me reconsider a bunch of related beliefs about ai.

+9. This is a powerful set of arguments pointing out how humanity will literally go extinct soon due to AI development (or have something similarly bad happen to us). A lot of thought and research went into an understanding of the problem that can produce this level of understanding of the problems we face, and I'm extremely glad it was written up.

TurnTrout is obviously correct that "robust grading is... extremely hard and unnatural" and that loss functions "chisel circuits into networks" and don't directly determine the target of the product AI. Where he loses me is the part where he suggests that this makes alignment easier and not harder. I think that all this just means we have even less control over the policy of the resulting AI, the default end case being some bizarre construction in policyspace with values very hard to determine based on the recipe. I don't understand what point he's making in the above post that contradicts this.

Seems to me like a blindingly obvious post that was kind of outside of the overton window for too long. Eliezer also smashed the window with his TIME article, but this was first, so I think it's still a pretty great post. +4

This is an example of a clear textual writeup of a principle of integrity. I think it's a pretty good principle, and one that I refer to a lot in my own thinking about integrity.

But even if I thought it was importantly flawed, I think integrity is super important, and therefore I really want to reward and support people thinking explicitly about it. That allows us to notice that our notions are flawed, and improve them, and it also allows us to declare to each other what norms we hold ourselves to, instead of sort of typical minding and assuming that our notion of integrity matches others' notion, and then being shocked when they behave badly on our terms.

I think about this framing quite a lot. Is what I say going to lead to people assuming roughly the thing I think even if I'm not precise. So the concept is pretty valuable to me. 

I don't know if it was the post that did it, but maybe!

I've used this a bit, but not loads. I prefer fatebook, metaculus and manifold and betting. I don't quite know why I don't use it more, here are some guesses.

  • I found the tool kind of hard to use
  • It was hard to search for the kind of information that I use to forecast
  • Often I would generate priors based on my current state, but those were wrong in strange ways (I knew something happened but after the deadline)
  • It wasn't clear that it was helping me to get better versus doing lots of forecasting on other platforms.

I view this post as providing value in three (related) ways:

  1. Making a pedagogical advancement regarding the so-called inner alignment problem
  2. Pointing out that a common view of "RL agents optimize reward" is subtly wrong
  3. Pushing for thinking mechanistically about cognition-updates

 

Re 1: I first heard about the inner alignment problem through Risks From Learned Optimization and popularizations of the work. I didn't truly comprehend it - sure, I could parrot back terms like "base optimizer" and "mesa-optimizer", but it didn't click. I was confused.

Some mon... (read more)

Decision theory is hard. In trying to figure out why DT is useful (needed?) for AI alignment in the first place, I keep running into weirdness, including with bargaining.

Without getting too in-the-weeds: I'm pretty damn glad that some people out there are working on DT and bargaining.

I feel like Project Lawful, as well as many of Lintamande's other glowfic since then, have given me a whole lot deeper an understanding of... a collection of virtues including honor, honesty, trustworthiness, etc, which I now mostly think of collectively as "Law".

I think this has been pretty valuable for me on an intellectual level—I think, if you show me some sort of deontological rule, I'm going to give a better account of why/whether it's a good idea to follow it than I would have before I read any glowfic.

It's difficult for me to separate how much of t... (read more)

Still seems too early to tell if this is right, but man is it a crux (explicit or implicit).

 

Terence Tao seems to have gotten some use out of the most recent LLMs.

Reading Project Lawful (so far, which is the majority of Book 1) has given me a strong mental pointer to the question of "how to model a civilization that you find yourself in" and "what questions to ask when trying to improve it and fix it", from a baseline of not really having a pointer to this at all (I have only lived in one civilization and I've not been dropped into a new one before). I would do many things differently to Keltham (I suspect I'd build prediction markets before trying to scale up building roads) but it's nonetheless extremely valuable ... (read more)

Holden's posts on writing and research (this one, The Wicked Problem Experience and Useful Vices for Wicked Problems) have been the most useful posts for me from Cold Takes and been directly applicable to things I've worked on. For instance the Wicked Problem Experience was published soon after I wrote a post about the discovery of laws of nature, and resonated with and validated some of how I approached and worked on that. I give all three +4.

I think this post is a short yet concrete description of what I and others find controlling and mind-warping about the present-day schooling system. It's one of the best short-posts that I can link to on this subject. +4

Someone working full-time on an approach to the alignment problem that they feel optimistic about, and writing annual reflections on their work, is something that has been sorely lacking. +4

This is a better spirit with which to accomplish great and important tasks than most I have around me, and I'm grateful that it was written up. I give this +4.

As with the CCS post, I'm reviewing both the paper and the post, though the majority of the review is on the paper. Writing this quickly (total time on review: ~1.5h), but I expect to be willing to defend the points being made --

There's a lot of reasons I like the work. It's an example of:

  1. Actually poking inside a real model. A lot of the mech interp work in early-mid 2022 was focused on getting a deep understanding of toy models trained on algorithmic tasks (at least in this community).[1] There was some effort at Redwood to do neuron-by-neuron replac
... (read more)

This post seems like it was quite influential. This is basically a trivial review to allow the post to be voted on.

I think of this as a fairly central post in the unofficial series on How to specialize in Problems We Don't Understand (which, in turn, is the post that most sums up what I think the art of rationality is for. Or at least the parts I'm most excited about).

I think this concept is important. It feels sort of... incomplete. Like, it seems like there are some major follow threads, which are:

  • How to teach others what useful skills you have.
  • How to notice when an expert has a skill, and 
    • how to ask them questions that help them tease out the details.

This feels like a helpful concept to re-familiarize myself with as I explore the art of deliberate practice, since "actually get expert advice on what/how to practice" is one of the most centrally recommended facets.

IMO, this post makes several locally correct points, but overall fails to defeat the argument that misaligned AIs are somewhat likely to spend (at least) a tiny fraction of resources (e.g., between 1/million and 1/trillion) to satisfy the preferences of currently existing humans.

AFAICT, this is the main argument it was trying to argue against, though it shifts to arguing about half of the universe (an obviously vastly bigger share) halfway through the piece.[1]

When it returns to arguing about the actual main question (a tiny fraction of resources) at the e... (read more)

In his dialogue Deconfusing Some Core X-risk Problems, Max H writes:

Yeah, coordination failures rule everything around me. =/

I don't have good ideas here, but something that results in increasing the average Lawfulness among humans seems like a good start. Maybe step 0 of this is writing some kind of Law textbook or Sequences 2.0 or CFAR 2.0 curriculum, so people can pick up the concepts explicitly from more than just, like, reading glowfic and absorbing it by osmosis. (In planecrash terms, Coordination is a fragment of Law that follows from Validity, Util

... (read more)

Safetywashing describes a phenomenon that is real, inevitable, and profoundly unsurprising (I am still surprised whenever I see it, but that's my fault for knowing something is probable and being surprised anyway). Things like this are fundamental to human systems; people who read the Sequences know this.

This post doesn't prepare people, at all, for the complexity of how this would play out in reality. It's possible that most posts would fail to prepare people, because these posts change goalposts; and in the mundane process of following their incentives, ... (read more)

I really enjoyed this sequence, it provides useful guidance on how to combine different sources of knowledge and intuitions to reason about future AI systems. Great resource on how to think about alignment for an ML audience. 

This is a short self-review, but with a bit of distance, I think understanding 'limits to legibility' is one of the maybe top 5 things an aspiring rationalist should deeply understand and lack of this leads to many bad outcomes in both rationalist and EA communities.

In a very brief form, maybe the most common cause of EA problem and stupidities are attempts to replace illegible S1 boxes able to represent human values such as 'caring' by legible, symbolically described, verbal moral reasoning subject to memetic pressure.

Maybe the most common cause of rationalist problems and difficulties with coordination are cases where people replace illegible smart S1 computations with legible S2 arguments.

In my personal view, 'Shard theory of human values' illustrates both the upsides and pathologies of the local epistemic community.

The upsides
- majority of the claims is true or at least approximately true
- "shard theory" as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,...
- shard theory coined a number of locally memetically fit names or phrases, such as 'shards'
- part of the success leads at some people in the AGI labs to... (read more)

These kinds of overview posts are very valuable, and I think this one is as well. I think it was quite well executed, and I've seen it linked a lot, especially to newer people trying to orient to the state of the AI Alignment field, and the ever growing number of people working in it. 

This is IMO actually a really important topic, and this is one of the best posts on it. I think it probably really matters whether the AIs will try to trade with us or care about our values even if we had little chance of making our actions with regards to them conditional on whether they do. I found the arguments in this post convincing, and have linked many people to it since it came out. 

I am not a huge fan of shard theory, but other people seem into it a bunch. This post captured at least a bunch of my problems with shard theory (though not all of them, and it's not a perfect post). This means the post at least has saved me some writing effort a bunch of times. 

Nuclear famine (and relatedly nuclear winter) is one of those things that comes up all the time in discussion about existential and catastrophic risk, and this post (together with the previous one) continues to be one of the things I reference most frequently when that topic comes up.

This is a review of both the paper and the post itself, and turned more into a review of the paper (on which I think I have more to say) as opposed to the post. 

Disclaimer: this isn’t actually my area of expertise inside of technical alignment, and I’ve done very little linear probing myself. I’m relying primarily on my understanding of others’ results, so there’s some chance I’ve misunderstood something. Total amount of work on this review: ~8 hours, though about 4 of those were refreshing my memory of prior work and rereading the paper. 

TL... (read more)

I really liked this post. It's not world-shattering, but it was a nice clear dive into a specific topic that I like learning about. I would be glad about a LessWrong with more posts like this.

I cannot claim to have used this tool very much since it was announced, but this sure seems like the way to go if one wanted to quickly get better at any kind of medium or long-term forecasting. I would also really love a review from someone who has actually used it a bunch, or a self-review by the people who built it. 

This post makes a pretty straightforward and important point, and I've referenced it a few times since then. It hasn't made a huge impact , and it isn't the best explanation, but I think it's a good one that covers the basics, and I think it could be linked to more frequently.

I think Redwood's classifier project was a reasonable project to work towards, and I think this post was great because it both displayed a bunch of important virtues and avoided doubling down on trying to always frame one's research in a positive light. 

I was really very glad to see this update come out at the time, and it made me hopeful that we can have a great discourse on LessWrong and AI Alignment where when people sometimes overstate things, they can say "oops", learn and move on. My sense is Redwood made a pretty deep update from the first post they published (and this update), and hasn't made any similar errors since then.

I am not that excited about marginal interpretability research, but I have nevertheless linked to this a few times. I think this post both clarifies a bunch of inroads into making marginal interpretability progress, but also maps out how long the journey between where we are and where many important targets are for using interpretability methods to reduce AI x-risk.

Separately, besides my personal sense that marginal interpretability research is not a great use of most researcher's time, there are really a lot of people trying to get started doing work on A... (read more)

I disagree with the conclusion of this post, but still found it a valuable reference for a bunch of arguments I do think are important to model in the space.

I thought this post and associated paper was worse than Richard's previous sequence "AGI safety from first principles", but despite that, I still think it's one of the best pieces of introductory content for AI X-risk. I've also updated that good communication around AI X-risk stuff will probably involve writing many specialized introductions that work within the epistemic frames and methodologies of many different communities, and I think this post does reasonably well at that for the ML community (though I am not a great judge of that).

I had read this post at the time, but forgotten about it since then. I've also made the central point of this post many times, and wish I had linked to it more, since it's a pretty good explanation.

I summarized my thoughts on this sequence in my other review on the next post in this sequence. Most of my thoughts there also apply to this post.

I am surprised that a post with nearly 650 karma doesn't have a review yet. It seems like it should have at least one so it can go through to the voting phase.

I am confused about whether the videos are real and exactly how much faster AIs could be run. But I think at the very least it's a promising direction to look for grokkable bounds on how advanced AI will go

I feel kind of conflicted about this post overall, but I certainly think using the concept fairly frequently.

This was some pretty engaging sci-fi, and I'm glad it's on LessWrong.

One of the best parts of the CFAR handbook and most of the sections are solid as a standalone. A compressed instruction manual to the human brain. The analogies and examples are incredibly effective at covering the approach from different angles, transmitting information about "how to win" into your brain with high fidelity, and what I predict will be a low failure rate for a wide variety of people.

  1. Be Present is a scathing criticism of how modern society keeps people sort of half-alive. It's my favorite and what I found to be the most important meta-skill;
... (read more)

Self Review. 

I wasn't sure at the time the effort I put into this post would be worth it. I spent around 8 hours I think, and I didn't end up with a clear gearsy model of how High Reliability Tends to work.

I did end up following up on this, in "Carefully Bootstrapped Alignment" is organizationally hard. Most of how this post applied there was me including the graph from the vague "hospital Reliability-ification process" paper, in which I argued:

The report is from Genesis Health System, a healthcare service provider in Iowa that services 5 hospitals. N

... (read more)

I think this post was quite helpful. I think it does a good job laying out a fairly complete picture of a pretty reasonable safety plan, and the main sources of difficulty. I basically agree with most of the points. Along the way, it makes various helpful points, for example introducing the "action risk vs inaction risk" frame, which I use constantly. This post is probably one of the first ten posts I'd send someone on the topic of "the current state of AI safety technology".

I think that I somewhat prefer the version of these arguments that I give in e.g. ... (read more)

I've had a vague intent to deliberately apply this technique since first reading this two years ago. I haven't actually done so, alas.

It still looks pretty good to me on paper, and feels like something I should attempt at some point. 

Lightcone has evolved a bit since Jacob wrote this, and also I have a somewhat different experience from Jacob. 

Updates:

  • "Meeting day" is really important to prevent people being blocked by meetings all week, but, it's better to do it on Thursday than Tuesday (Tuesday Meeting Days basically kill all the momentum you built up on Monday)
  • We hit the upper limits of how many 1-1 public DM channels really made sense (because it grew superlinearly with the number of employees). We mostly now have "wall channels" (i.e. raemon-wall), where people who want to me
... (read more)

My previous review of this is in this older comment. Recap:

  • I tried teaching a variation of this exercise that was focused on observing your internal state (sort of as an alternative to "Focusing". I forgot to include the "meta strategy" step, which upon reflection was super important (so important I independently derived it for another portion of the same workshop I was running at the time). The way I taught this exercise it fell flat, but I think this was probably in my presentation.
  • I did have people do the "practice observing things" (physical things, no
... (read more)

I should acknowledge first that I understand that writing is hard. If the only realistic choice was between this post as it is, and no post at all, then I'm glad we got the post rather than no post.

That said, by the standards I hold my own writing to, I would embarrassed to publish a post like this which criticizes imaginary paraphrases of researchers, rather than citing and quoting the actual text they've actually published. (The post acknowledges this as a flaw, but if it were me, I wouldn't even publish.) The reason I don't think critics necessarily nee... (read more)

This was one of those posts that I dearly wish somebody else besides me had written, but nobody did, so here we are. I have no particular expertise. (But then again, to some extent, maybe nobody does?)

I basically stand by everything I wrote here. I remain pessimistic for reasons spelled out in this post, but I also still have a niggling concern that I haven’t thought these things through carefully enough, and I often refer to this kind of stuff as “an area where reasonable people can disagree”.

If I were rewriting this post today, three changes I’d make wou... (read more)

This post is not only a groundbreaking research into the nature of LLMs but also a perfect meme. Janus's ideas are now widely cited at AI conferences and papers around the world. While the assumptions may be correct or incorrect, the Simulators theory has sparked huge interest among a broad audience, including not only AI researchers. Let's also appreciate the fact that this post was written based on the author's interactions with non-RLHFed GPT-3 model, well before the release of ChatGPT or Bing, and it has accurately predicted some quirks in their behavi... (read more)

The thing I want most from LessWrong and the Rationality Community writ large is the martial art of rationality. That was the Sequences post that hooked me, that is the thing I personally want to find if it exists, that is what I thought CFAR as an organization was pointed at.

When you are attempting something that many people have tried before- and to be clear, "come up with teachings to make people better" is something that many, many people have tried before- it may be useful to look and see what went wrong last time.

In the words of Scott Alexander, "I’m... (read more)

Figuring out the edge cases about honesty and truth seem important to me, both as a matter of personal aesthetics and as a matter for LessWrong to pay attention to. One of the things people have used to describe what makes LessWrong special is that it's a community focused on truth-seeking, which makes "what is truth anyway and how do we talk about it" a worthwhile topic of conversation. This article talks about it, in a way that's clear. (The positive example negative example pattern is a good approach to a topic that can really suffer from illusion of tr... (read more)

I like this article for having a amusing stories told with cute little diagrams that manage to explain a specific mental technique. At the end of reading it, I sat down, drew some trees, nodded, and felt like I'd learned a new tool. It's not a tool I use explicitly very often, but I use it a little, using it more wouldn't hurt, and if it happens to be a tool you'd use a lot (maybe because it covers a gap or mistake you make more) then this article is a really good explanation of how to use it and why.

It's interesting to compare this to Goal Factoring. They... (read more)

This was counter to the prevailing narrative at the time, and I think did some of the work of changing the narrative. It's of historical significance, if nothing else.

I'm reaffirming my relatively extensive review of this post.

The simbox idea seems like a valuable guide for safely testing AIs, even if the rest of the post turns out to be wrong.

Here's my too-terse summary of the post's most important (and more controversial) proposal: have the AI grow up in an artificial society, learning self-empowerment and learning to model other agents. Use something like retargeting the search to convert the AI's goals from self-empowerment to empowering other agents.

I'm reaffirming my relatively long review of Drexler's full QNR paper.

Drexler's QNR proposal seems like it would, if implemented, guide AI toward more comprehensible systems. It might modestly speed up capabilities advances, while being somewhat more effective at making alignment easier.

Alas, the full paper is long, and not an easy read. I don't think I've managed to summarize its strengths well enough to persuade many people to read it.

I've used the term "safetwashing" at least once every week or two in the last year. I don't know whether I've picked it up from this post, but it still seems good to have an explanation of a term that is this useful and this common that people are exposed to.

I currently think that the case study of computer security is among one of the best places to learn about the challenges that AI control and AI Alignment projects will face. Despite that, I haven't seen that much writing trying to bridge the gap between computer security and AI safety. This post is one of the few that does, and I think does so reasonably well.

For the past few years I've read Logan-stuff, and felt a vague sense of impatience about it, and a vague sense of "if I were more patient, maybe a good thing would happen though?". This year I started putting more explicit effort into cultivating patience.

I've read this post thrice now, and each time I start out thinking "yeah, patience seems like a thing I could have more off"... and then I get to the breakdown of "tenacity, openness and thoroughness" and go "oh shit I forgot about that breakdown. What a useful breakdown." It feels helpful because it sugg... (read more)

I've long believed TAPs are a fundamental skill-building block. But I've noticed lately that I never really gained, or solidified as well as I'd like, the skill of building TAPs. 

I just reread this post to see if it'd help. One paragraph that stands out to me is this:

And in cases where this is not enough—where your trigger does indeed fire, but after two weeks of giving yourself the chance to take the stairs, you discover that you have actually taken yourself up on it zero times—the solution is not TAPs!  The problem lies elsewhere—it's not an is

... (read more)

Solid, aside from the faux-pass self-references. If anyone wonders why people would have a high p(doom), especially Yudkowsky himself, this doc solves the problem in a single place. Demonstrates why AI safety is superior to most other elite groups; we don't just say why we think something, we make it easy to find as well. There still isn't much need for Yudkowsky to clarify further, even now.

I'd like to note that my professional background makes me much better at evaluating Section C than Sections A and B. Section C is highly quotable, well worth multiple ... (read more)

I tested it on a bunch of different things, important and unimportant. It did exactly what it said it did; substantially better predictions, immediate results, costs either little or no effort (aside from the initial cringe away from using this technique at all). It just works.

Looking at it like training/practicing a martial art is helpful as well.

This is huge if it works; you're basically able to reprogram your entire life. 

I haven't gotten it to work yet, this post makes it look like it's not that hard to set up. In my case at least, it will probably require multiple days to set the trigger. Probably worth many hours of effort to set a trigger for "stop and think about what I should be thinking about", but I've heard at least one other person having a hard time setting the trigger such that it would go off. 

I, at least, was led to think by this post that it would less effort than actuall... (read more)

Sharp Left Turn: a more important problem (and a more specific threat model) than people usually think

The sharp left turn is not a simple observation that we've seen capabilities generalise more than alignment. As I understand it, it is a more mechanistic understanding that some people at MIRI have, of dynamics that might produce systems with generalised capabilities but not alignment.

Many times over the past year, I've been surprised by people in the field who've read Nate's post but somehow completely missed the part where it talks about specific dynamic... (read more)

I read this a year or two ago, tucked it in the back of my mind, and continued with life.

When I reread it today, I suddenly realized oh duh, I’ve been banging my head against this on X for months. I’d noticed there was this interpersonal dynamic that kept trying to blow up, where I kept not seeing the significance of phrasings or word choices other people said they found deeply important. I’d been using Typical Mind Fallacy and trying to figure out how to see through their eyes, and it kept not working.

Colour Blindness feels like a close cousin of Typical ... (read more)

I think that (1) this is a good deconfusion post, (2) it was an important post for me to read, and definitely made me conclude that I had been confused in the past, (3) and one of the kinds of posts that, ideally, in some hypothetical and probably-impossible past world, would have resulted in much more discussion and worked-out-cruxes in order to forestall the degeneration of AI risk arguments into mutually incomprehensible camps with differing premises, which at this point is starting to look like a done deal?

On the object level: I currently think that --... (read more)

I found this post to be a clear and reasonable-sounding articulation of one of the main arguments for there being catastrophic risk from AI development. It helped me with my own thinking to an extent. I think it has a lot of shareability value.

I think this post has the highest number of people who have reached out privately to me, thanking me for it. (I think 4-5 people). So seems like it's been at least pretty useful to some people.

I previously wrote an addendum a few months after publishing, mostly saying that while I stood by the technical words of the post, I felt like the narrative vibe of the post somewhat oversold my particular approach. Major grievings still seem to take a long time.

One update is that, the following December after writing this, I did a second "Private Dark Solstice" ritu... (read more)

This post is really important as a lot of other materials on LessWrong (notably AI to Zombies) really berate the idea that trying out things that haven't been tested via the Scientific Method. 
This post explains that some (especially health) conditions may go completely outside the scope of testable-via-scientific-method, and at some point turning to chance is a good idea, reminding us that intuition may be often wrong but it can work wonders when used as a last resort. 
This is something to remember when trying to solve problems that don't seem to have one perfect mathematical solution (yet).

Overall I'm delighted with this post. It gave me a quick encapsulation of an idea I now refer to a lot, and I've received many reports of it inspiring other people to run helpful tests.

A number of my specifics were wrong; it now looks like potatoes were irrelevant or at least insufficient for weight loss, and I missed the miracle of watermelon.  I think this strengthens rather than weakens the core philosophical point of the post, although of course I'd rather have been right all along. 

This post didn't lead to me discovering any new devices, and I haven't heard from anyone who found something they valued via it. So overall not a success, but it was easy to write so I don't regret the attempt. 

This experiment didn't really work out, but it's the kind of thing I expect to produce really great results every once in a while. 

Shoulder!Justis telling me to replace "it", "this", etc with real nouns is maybe the most legible improvement in my writing over the last few years.

I have three views on this post.

One view: The first section (say, from "While working as the curriculum director" to "Do you know what you are doing, and why you are doing it?") I want as its own post. The Fundamental Question is too short. (https://www.lesswrong.com/posts/xWozAiMgx6fBZwcjo/the-fundamental-question) I think this is a useful question to have loaded in a person's brain, and the first section of this post explains how to use it and makes a pitch for why it's important. I haven't yet linked someone to How To: A Workshop (or anything) and told ... (read more)

I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency). 

This post and its companion have even more resonance now that I'm deeper into my graduate education and conducting my research more independently.

Here, the key insight is that research is an iterative process of re-scoping the project and execution on the current version of the plan. You are trying to make a product sufficient to move the conversation forward, not (typically) write the final word on the subject.

What you know, what resources you have access to, your awareness of what people care about, and what there's demand for, depend on your output. Tha... (read more)

As of October, MIRI has shifted its focus. See their announcement for details.

I looked up MIRI's hiring page and it's still in about the same state. This kind of makes sense given the FTX implosion. But I would ask whether MIRI is unconcerned with the criticism it received here and/or actively likes their approach to hiring? We know Eliezer Yudkowsky, who's on their senior leadership team and board of directors, saw this, because he commented on it.

I found it odd that 3/5 members of the senior leadership team, Malo Bourgon, Alex Vermeer, and Jimmy Rintjema... (read more)

I replicated this review, which you can check out in this colab notebook (I get much higher performance running it locally on my 20-core CPU).

There is only one cluster of discrepancies I found between my analysis and Vaniver's: in my analysis, mating is even more assortative than in the original work:

  • Pearson R of the sum of partner stats is 0.973 instead of the previous 0.857
  • 99.6% of partners have an absolute sum of stats difference < 6, instead of the previous 83.3%.
  • I wasn't completely sure if Vaniver's "net satisfaction" was the difference of self-sat
... (read more)

I assign a decent probability to this sequence (of which I think this is the best post) being the most important contribution of 2022. I am however really not confident of that, and I do feel a bit stuck on how to figure out where to apply and how to confirm the validity of ideas in this sequence. 

Despite the abstract nature, I think if there are indeed arguments to do something closer to Kelly betting with one's resources, even in the absence of logarithmic returns to investment, then that would definitely have huge effects on how I think about my ow... (read more)

This post snuck up on me.

The first time I read it, I was underwhelmed.  My reaction was: "well, yeah, duh.  Isn't this all kind of obvious if you've worked with GPTs?  I guess it's nice that someone wrote it down, in case anyone doesn't already know this stuff, but it's not going to shift my own thinking."

But sometimes putting a name to what you "already know" makes a whole world of difference.

Before I read "Simulators," when I'd encounter people who thought of GPT as an agent trying to maximize something, or people who treated MMLU-like one... (read more)

My current view is this post is decent at explaining something which is "2nd type of obvious" in a limited space, using a physics metaphor.  What is there to see is basically given in the title: you can get a nuanced understanding of the relations between deontology, virtue ethics and consequentialism using the frame of "effective theory" originating in physics, and using "bounded rationality" from econ.

There are many other ways how to get this: for example, you can read hundreds of pages of moral philosophy, or do a degree in it.  Advantage of t... (read more)

This post didn't feel particularly important when I first read it.

Yet I notice that I've been acting on the post's advice since reading it. E.g. being more optimistic about drug companies that measure a wide variety of biomarkers.

I wasn't consciously doing that because I updated due to the post. I'm unsure to what extent the post changed me via subconscious influence, versus deriving the ideas independently.

I think it's a bit hard to tell how influential this post has been, though my best guess is "very". It's clear that sometime around when this post was published there was a pretty large shift in the strategies that I and a lot of other people pursued, with "slowing down AI" becoming a much more common goal for people to pursue.

I think (most of) the arguments in this post are good. I also think that when I read an initial draft of this post (around 1.5 years ago or so), and had a very hesitant reaction to the core strategy it proposes, that I was picking up... (read more)

Epistemic status: I read the entire post slowly, taking careful sentence-by-sentence notes. I felt I understood the author's ideas and that something like the general dynamic they describe is real and important. I notice this post is part of a larger conversation, at least on the internet and possibly in person as well, and I'm not reading the linked background posts. I've spent quite a few years reading a substantial portion of LessWrong and LW-adjacent online literature and I used to write regularly for this website.

This post is long and complex. Here ar... (read more)

I wrote a review here. There, I identify the main generators of Christiano's disagreement with Yudkowsky[1] and add some critical commentary. I also frame it in terms of a broader debate in the AI alignment community.

  1. ^

    I divide those into "takeoff speeds", "attitude towards prosaic alignment" and "the metadebate" (the last one is about what kind of debate norms should we have about this or what kind of arguments should we listen to.)

Rich, mostly succinct pedagogy of timeless essentials, highly recommended reference post.

The prose could be a little tighter and less self-conscious in some places. (Things like "I won't go through all of that in this post. There are several online resources that do a good job of explaining it" break the flow of mathematical awe and don't need to be said.)

I'm proud of this post, but it doesn't belong in the Best-of-2022 collection because it's on a niche topic.

This post is one of the best available explanations of what has been wrong with the approach used by Eliezer and people associated with him.

I had a pretty favorable recollection of the post from when I first read it. Rereading it convinced me that I still managed to underestimate it.

In my first pass at reviewing posts from 2022, I had some trouble deciding which post best explained shard theory. Now that I've reread this post during my second pass, I've decided this is the most important shard theory post. Not because it explains shard theory best, but bec... (read more)

I think this post successfully got me to notice this phenomena when I do it, at least sometimes. 

For me, the canonical example here is "how clean exactly are we supposed to keep the apartment?", where there's just a huge array of how much effort (and how much ambient clutter) is considered normal.

I notice that, while I had previously read both Setting the Default and Choosing the Zero Point, this post seemed to do a very different thing (this is especially weird because it seems like structurally it's making the exact same argument as Setting the Defa... (read more)

I liked this for the idea that fear of scarcity can drive "unreasonable" behaviors. This helps me better understand why others may behave in "undesirable" ways and provides a more productive way of addressing the problem than blaming them for being e.g. selfish. This also provides a more enjoyable way of changing my behaviors. Instead of being annoyed with myself for e.g. being too scared to talk to people, I look out for tiny accomplishments (e.g. speaking up when someone got my order wrong) and the benefits it brings (e.g. getting what I wanted to order), to show myself that I am capable. The more capable I feel, the less afraid I am of the world.

When this post came out, I left a comment saying:

It is not for lack of regulatory ideas that the world has not banned gain-of-function research.

It is not for lack of demonstration of scary gain-of-function capabilities that the world has not banned gain-of-function research.

What exactly is the model by which some AI organization demonstrating AI capabilities will lead to world governments jointly preventing scary AI from being built, in a world which does not actually ban gain-of-function research?

Given how the past year has gone, I should probably lose at... (read more)

Based on occasional conversations with new people, I would not be surprised if a majority of people who got into alignment between April 2022 and April 2023 did so mainly because of this post. Most of them say something like "man, I did not realize how dire the situation looked" or "I thought the MIRI folks were on it or something".

This post points to a rather large update, which I think has not yet propagated through the collective mind of the alignment community. Gains from algorithmic improvement have been roughly comparable to gains from compute and data, and much larger on harder tasks (which are what matter for takeoff).

Yet there's still an implicit assumption behind lots of alignment discussion that progress is mainly driven by compute. This is most obvious in discussions of a training pause: such proposals are almost always about stopping very large runs only. That would stop... (read more)

I think this post was good as something like a first pass.

There's a large and multi-armed dynamic in modern Western liberal society that is a kind of freezing-in-place, as more and more moral weight gets attached to whether or not one is consciously avoiding harm in more and more ways.

For the most part, this is a positive process (and it's at the very least well-intentioned). But it's not as strategic as it could be, and substantially less baby could be thrown out with the bathwater.

This was an attempt to gesture at some baby that, I think, is being thrown... (read more)

I find this post fairly uninteresting, and feel irritated when people confidently make statements about "simulacra." One problem is, on my understanding, that it doesn't really reduce the problem of how LLMs work. "Why did GPT-4 say that thing?" "Because it was simulating someone who was saying that thing." It does postulate some kind of internal gating network which chooses between the different "experts" (simulacra), so it isn't contentless, but... Yeah. 

Also I don't think that LLMs have "hidden internal intelligence", given e.g LLMs trained on “A i... (read more)

This essay had a significant influence on my growth in the past two years. I shifted from perceiving discomfort as something I am subject to, to considering my relationship with discomfort as an object that can be managed. There are many other writings and experiences that contributed to this growth, but this was the first piece I encountered that talked about managing our relationship with hazards as a thing we can manipulate and improve at. It made me wonder why all human activity may be considered running in the meadow and why contracting may be bad, it... (read more)

This is short, has good object level advice, and points at a useful general lesson.

A lot of LessWrong articles are meta. They're about how to find out about things, or abstract theories about how to make decisions. This article isn't like that. Learning the specific lesson it's trying to teach takes minutes, and might save a life. Not "might save a life" as in "the expected value means somewhere out there in the distant world or distant future the actuarial statistics might be a little different." "Might save a life" as in "that person who, if it comes up,... (read more)

This post was critically important to the core task of solving alignment - or deciding we can't solve it in time and must take low-odds alternative strategies.

Letting Eliezer be the lone voice in the wilderness isn't a good idea. This post and others like it trying to capture his core points in a different voice are crucial.

After going back and forth between this post and the original LoL several times, I think Zvi has captured the core points very well.

This post was, in the end, largely a failed experiment. It did win a lesser prize, and in a sense that proved its point, and I had fun doing it, but I do not think it successfully changed minds, and I don't think it has lasting value, although someone gave it a +9 so it presumably worked for them. The core idea - that EA in particular wants 'criticism' but it wants it in narrow friendly ways and it discourages actual substantive challenges to its core stuff - does seem important. But also this is LW, not EA Forum. If I had to do it over again, I wouldn't bother writing this.

I am flattered that someone nominated this but I don't know why. I still believe in the project, but this doesn't match at all what I'd look to in this kind of review? The vision has changed and narrowed substantially. So this is a historical artifact of sorts, I suppose, but I don't see why it would belong.

I think this post did good work in its moment, but doesn't have that much lasting relevance and can't see why someone would revisit at this point. It shouldn't be going into any timeless best-of lists.

I continue to frequently refer back to my functional understanding of bounded distrust. I now try to link to 'How To Bounded DIstrust' instead because it's more compact, but this is I think the better full treatment for those who have the time. I'm sad this isn't seeing more support, presumably because it isn't centrally LW-focused enough? But to me this is a core rationalist skill not discussed enough, among its other features.

That's certainly an interesting position in discussion about what people want!

Namely, that actions and preferences are just conditionally-activated and those context activations are balanced against each other. That means that person's preference system may be not only incomplete but incoherent in architecture, and moral systems and goals obtained via reflection are almost certainly not total (will lack in some contexts), creating problem in RLHF.

The first assumption, that part of neurons is basically randomly initialized, can't be tested really well becau... (read more)

Doesn't say anything particularly novel but presents it in a very clear and elegant manner which has value in and of itself.

This is a great complement to Eliezer's 'List of lethalities' in particular because in cases of disagreements beliefs of most people working on the problem were and still mostly are are closer to this post. Paul writing it provided a clear, well written reference point, and with many others expressing their views in comments and other posts, helped made the beliefs in AI safety more transparent.

I still occasionally reference this post when talking to people who after reading a bit about the debate e.g. on social media first form oversimplified model of the... (read more)

Good post, and relevant in 2023 because of a certain board-related brouhaha.

This is one of the few posts on LW from 2022 that I shared with people completely unrelated to the scene, because it was so fun.

Sometimes posts don't have to be about the most important issues to be good. They can just be good.

"Exfohazard" is a quicker way to say "information that should not be leaked". AI capabilities has progressed on seemingly-trivial breakthroughs, and now we have shorter timelines.

The more people who know and understand the "exfohazard" concept, the safer we are from AI risk.

More framings help the clarity of the discussion. If someone doesn't understand (or agree with) classic AI-takeover scenarios, this is one of the posts I'd use to explain them.

I really liked this post in that it seems to me to have tried quite seriously to engage with a bunch of other people's research, in a way that I feel like is quite rare in the field, and something I would like to see more of. 

One of the key challenges I see for the rationality/AI-Alignment/EA community is the difficulty of somehow building institutions that are not premised on the quality or tractability of their own work. My current best guess is that the field of AI Alignment has made very little progress in the last few years, which is really not w... (read more)

I hadn't seen this post at all until a couple weeks ago. I'd never heard "exfohazard" or similar used. 

Insisting on using a different word seems unnecessary. I see how it can be confusing. I also ran into people confused by this a few years ago, and proposed "cognitohazard" for the "thing that harms the knower" subgenre. That also has not caught on. XD The point is, I'm pro-disambiguating the terms, since they have different implications. But I still believe what I did then, that the original broader meaning of the word "infohazard" is occasionally us... (read more)

I've been thinking about this post a lot since it first came out. Overall, I think it's core thesis is wrong, and I've seen a lot of people make confident wrong inferences on the basis of it. 

The core problem with the post was covered by Eliezer's post "GPTs are Predictors, not Imitators" (which was not written, I think, as a direct response, but which still seems to me to convey the core problem with this post):  

Imagine yourself in a box, trying to predict the next word - assign as much probability mass to the next token as possible - for all t

... (read more)

i think about this story from time to time. it speaks to my soul.

  • it is cool that straight-up utopian fiction can have this effect on me.
  • it yanks me in a state of longing. it's as if i lost this world a long time ago, and i'm desperately trying to regain it.

i truly wish everything will be ok :,)

thank you for this, tamsin.

The post explains difference between two similar-looking positions:

  1. Model gets reward, and attempts to maximize it by selecting appropriate actions.
  2. Model does some actions on the environment, and selection/mutation process calculates the reward to find out how to maximize it by modifying the model.

Those positions differ in level at which optimization pressure is applied. Actually, either of them can be implemented (so, the "reward" value can be given to the model), but common RL uses the second option.

The way I understood it, this post is thinking aloud while embarking on the scientific quest of searching for search algorithms in neural networks. It's a way to prepare the ground for doing the actual experiments. 

Imagine a researcher embarking on the quest of "searching for search". I highlight in cursive the parts present in the post (if they are present at least a little):

- At some point, the researcher reads Risks From Learned Optimization.
- They complain: "OK, Hubinger, fine, but you haven't told me what search is anyway"
- They read or get invol... (read more)

I really like this post, this is very influential about how I think about plans, and what to work on. I do think its a bit vague though, and lacking in a certain kind of general formulation. It may be better if there were more examples listed where the technique could be used.

I like this post! Steven Byrnes, and Jacob Cannell are two people with big models of the brain and intelligence which give concrete predictions which are unique, and large contributors to my own thinking. The post can only be excellent, and indeed it is! Byrnes doesn't always respond to Cannell how I would, but his responses usually shifted my opinion somewhat.

I used to deal with disappointment by minimizing it (e.g. it's not that important) or consoling myself (e.g. we'll do better next time). After reading this piece, I think to myself "disappointment is baby grief". 

Loss is a part of life, whether that is loss of something concrete/"real" or something that we imagined or hoped for. Disappointment is an opportunity to practice dealing with loss, so that I will be ready for the inevitable major losses in the future. I am sad because I did not get what I'd wanted or hoped for, and that is okay.

Clearly a very influential post on a possible path to doom from someone who knows their stuff about deep learning! There are clear criticisms, but it is also one of the best of its era. It was also useful for even just getting a handle on how to think about our path to AGI.

I think this point is incredibly important and quite underrated, and safety researchers often do way dumber work because they don't think about it enough.

(I'm just going to speak for myself here, rather than the other authors, because I don't want to put words in anyone else's mouth. But many of the ideas I describe in this review are due to other people.)

I think this work was a solid intellectual contribution. I think that the metric proposed for how much you've explained a behavior is the most reasonable metric by a pretty large margin.

The core contribution of this paper was to produce negative results about interpretability. This led to us abandoning work on interpretability a few months later, which I'm... (read more)

This is a great reference for the importance and excitement in Interpretability.

I just read this for the first time today. I’m currently learning about Interpretability in hopes I can participate, and this post solidified my understanding of how Interpretability might help.

The whole field of Interpretability is a test of this post. Some of the theories of change won’t pan out. Hopefully many will. Perhaps more theories not listed will be discovered.

One idea I’m surprised wasn’t mentioned is the potential for Interpretability to supercharge all of the scien... (read more)

I really like this paper! This is one of my favourite interpretability papers of 2022, and has substantially influenced my research. I voted at 9 in the annual review. Specific things I like about it:

  • It really started the "narrow distribution" focused interpretability, just examining models on sentences of the form "John and Mary went to the store, John gave a bag to" -> " Mary". IMO this is a promising alternative focus to the "understand what model components mean on the full data distribution" mindset, and worth some real investment in. Model compo
... (read more)

Meta level I wrote this post in 1-3 hours, and am very satisfied with the returns per unit time! I don't think this is the best or most robust post I could have written, and I think some of these theories of impact are much more important than others. But I think that just collecting a ton of these in the same place was a valuable thing to do, and have heard from multiple people who appreciated this post's existence! More importantly, it was easy and fun, and I personally want to take this as inspiration to find more, easy-to-write-yet-valuable things to d... (read more)

Self-Review: After a while of being insecure about it, I'm now pretty fucking proud of this paper, and think it's one of the coolest pieces of research I've personally done. (I'm going to both review this post, and the subsequent paper). Though, as discussed below, I think people often overrate it.

Impact The main impact IMO is proving that mechanistic interpretability is actually possible, that we can take a trained neural network and reverse-engineer non-trivial and unexpected algorithms from it. In particular, I think by focusing on grokking I (semi-acci... (read more)

This post suggests a (and, quite possibly, the) way to select outcome in bargaining (fully deterministic multi-player game).

ROSE values replace Competition-Cooperation ones by having players not compete against each other in attempts to extract more utility but average their payoffs over possible initiative orders. A probable social consequence is noted, that people wouldn't threaten everyone else in order to get something they want but would rather maximize own utility (improve their situation themselves).

ROSE values are resistant to unconditional threats... (read more)

I think this post highlights some of the difficulties in transmitting information between people - particularly the case of trying to transmit complex thoughts via short aphorisms.

I think the comments provided a wealth of feedback as to the various edge cases that can occur in such a transmission, but the broad strokes of the post remain accurate: understanding compressed wisdom can't really be done without the life experience that the wisdom tried to compress to begin with.

If I was to rewrite the post, I'l likely emphasize the takeaway that, when giving a... (read more)

When I think of useful concepts in AI alignment that I frequently refer to, there are a bunch from the olden days (e.g. “instrumental convergence”, “treacherous turn”, …), and a bunch of idiosyncratic ones that I made up myself for my own purposes, and just a few others, one of which is “concept extrapolation”. For example I talk about it here. (Others in that last category include “goal misgeneralization” [here’s how I use the term] (which is related to concept extrapolation) and “inner and outer alignment” [here’s how I use the term].)

So anyway, in the c... (read more)

This post is correct, and the point is important for people who want to use algorithmic information theory.

But, as many commenters noted, this point is well understood among algorithmic information theorists. I was taught this as a basic point in my algorithmic information theory class in university (which tbc was taught by one of the top algorithmic information theorists, so it's possible that it's missed in other treatments).

I'm slightly frustrated that Nate didn't realize that this point is unoriginal. His post seems to take this as an example of a case... (read more)

Since simplicity is such a core heuristic of rationality, having a correct conceptualization of complexity is important.

The usual definition of complexity used on LW (and elsewhere) is Kolmogorov Complexity, which the length of its shortest code. Nate suggests a compelling idea - that another way to be simple is to have many different codes. For example, a state with five 3-bit codes is simpler than a state with one 2-bit code.

Nate justifies that with a bunch of math, and some physics, which I don't understand well enough to comment on.

A bunch of people wh... (read more)

Hmmmm.

So when I read this post I initially thought it was good. But on second thought I don't think I actually get that much from it. If I had to summarise it, I'd say

  • a few interesting anecdotes about experiments where measurement was misleading or difficult
  • some general talk about "low bit experiments" and how hard it is to control for cofounders

The most interesting claim I found was the second law of experiment design. To quote: "The Second Law of Experiment Design: if you measure enough different stuff, you might figure out what you’re actually meas... (read more)

In this post, I appreciated two ideas in particular:

  1. Loss as chisel
  2. Shard Theory

"Loss as chisel" is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can't really argue with it and it doesn't sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In p... (read more)

Probably not the most important thing ever, but this is really pleasing to look at, from the layout to the helpful pictures, which makes it an absolute joy to read.

Also pretty good at explaining Chinchilla scaling too I guess.

I tend to get into pointless internet arguments where both parties end up agreeing but in a bitter angry way somehow so I needed to hear this. 

Practically useful and expands/words coherently intuitions I already had. Before I had said intuition, I tended to take a lower-than-optimal amount of risks so I think this post will be similarly useful to careful people like me.

I like many aspects of this post. 

  • It promotes using intuitions from humans. Using human, social, or biological approaches is neglected compared to approaches that are more abstract and general. It is also scalable, because people can work on it that wouldn't be able to work directly on the abstract approaches.
  • It reflects on a specific problem the author had and offers the same approach to readers.
  • It uses concrete examples to illustrate.
  • It is short and accessible. 

I like the alternative presentation of Logical Induction. I believe Logical Induction to be an important concept and for such concepts making it accessible for different audiences or different cognitive styles is great.

I'm a father of four sons myself, and I'm very happy about this write-up and its lens on observation of learning of the growing child. Everything must be learned, and it happens stage by stage. Each stage builds on top of previous stages and in many cases, there are very many small steps that can be observed.

This piece was reasonably well-appreciated (over 100 points) but I nevertheless think of it as one of my most underrated posts, given my sense of how important/crucial the insight is. For me personally, this is one of the largest epiphanies of the past decade, and I think this is easily among the top three most valuable bits of writing I did in 2022. It's the number one essay I go out of my way to promote to the attention of people who already occasionally read my writing, given its usefulness and its relative obscurity.

If I had the chance to write this ov... (read more)

This post summarises and elaborates on Neil Postman's underrated "Amusing Ourselves to Death", about the effects of mediums (especially television) on public discourse. I wrote this post in 2019 and posted it to LessWrong in 2022.

Looking back at it, I continue to think that Postman's book is a valuable and concise contribution and formative to my own thinking on this topic. I'm fond of some of the sharp writing that I managed here (and less fond of other bits).

The broader question here is: "how does civilisation set up public discourse on important topics ... (read more)

Many of the best LessWrong posts give a word and a clear mental handle for something I kinda sorta knew loosely in my head. With the concept firmly in mind, I can use it and build on it deliberately. Sazen is an excellent example of the form.

Sazens are common in many fields I have some expertise in. "Control the centre of the board" in chess. "Footwork is foundational" in martial arts. "Shots on goal" in sports. "Conservation of expected evidence" in rationality. "Premature optimization is the root of all evil" in programming. These sentences a useful remi... (read more)

While the concept that looking at the truth even when it hurts is important isn't revolutionary in the community, I think this post gave me a much more concrete model of the benefits. Sure, I knew about the abstract arguments that facing the truth is valuable, but I don't know if I'd have identified it as an essential skill for starting a company, or as being a critical component of staying in a bad relationship. (I think my model of bad relationships was that people knew leaving was a good idea, but were unable to act on that information—but in retrospect inability to even consider it totally might be what's going on some of the time.)

I don't think this would fit into the 2022 review. Project Lawful has been quite influential, but I find it hard to imagine a way its impact could be included in a best-of.

Including this post in particular strikes me as misguided, as it contains none of the interesting ideas and lessons from Project Lawful, and thus doesn't make any intellectual progress.

One could try to do the distillation of finding particularly interesting or enlightening passages from the text, but that would be

  1. A huge amount of work[1], but maybe David Udell's sequence could be used
... (read more)

This is an important post which I think deserves inclusion in the best-of compilation because despite it's usefulness and widespread agreement about that fact, it seems notto have been highlighted well to the community. 

This makes an important point that I find myself consistently referring to - almost none of the confidence in predictions, even inside the rationalist community, is based on actual calibration data. Experts forecast poorly, and we need to stop treating expertise or argumentation as strong stand-alone reasons to accept claims which are implicitly disputed by forecasts.

On the other hand, I think that this post focused far too much on Eliezer. In fact, there are relatively few people in the community who have significant forecasting track records, and this co... (read more)

I did not see this post when it was first put on the forum, but reading it now, my personal view of this post is that it continues a trend of wasting time on a topic that is already a focus of too much effort, with little relevance to actual decisions, and no real new claim that the problems were relevant or worth addressing.

I was even more frustrated that it didn't address most of the specific arguments put forward in our paper from a year earlier on why value for decisionmaking was finite, and then put forward seeral arguments we explicitly gave reasons ... (read more)

This post publicly but non-confrontationally rebutting an argument that had been put forward and promoted by others was a tremendous community service, of a type we see too rarely, albeit far more often in this community than most. It does not engage in strawmanning, it clearly lays out both the original claim and the evidence, and it attempts to engage positively, including trying to find concrete predictions that the disputing party could agree with. 

I think this greatly moved community consensus on a moderately important topic in ways that were ver... (read more)

IMO the biggest contribution of this post was popularizing having a phrase for the concept of mode collapse in the context of LLMs and more generally and as an example of a certain flavor of empirical research on LLMs. Other than that it's just a case study whose exact details I don't think are so important.

Edit: This post introduces more useful and generalizable concepts than I remembered when I initially made the review.

To elaborate on what I mean by the value of this post as an example of a certain kind of empirical LLM research: I don't know of much pu... (read more)

I think Simulators mostly says obvious and uncontroversial things, but added to the conversation by pointing them out for those who haven't noticed and introducing words for those who struggle to articulate. IMO people that perceive it as making controversial claims have mostly misunderstood its object-level content, although sometimes they may have correctly hallucinated things that I believe or seriously entertain. Others have complained that it only says obvious things, which I agree with in a way, but seeing as many upvoted it or said they found it ill... (read more)

This post helped me relate to my own work better. I feel less confused about what's going on with the differences between my own working pace and the pace of many around me. I am obviously more like a 10,000 day monk than a 10 day monk, and I should think and plan accordingly. 

Partly because I read this post, I spend frewer resources frantically trying to show off a Marketable Product(TM) as quickly as possible ("How can I make a Unit out of this for the Workshop next month?"), and I spend more resources aiming for the progress I actually think would ... (read more)

After I read this, I started avoiding reading about others' takes on alignment so I could develop my own opinions.

Initially, I sorta felt bummed out that a post-singularity utopia would render my achievements meaningless. After reading this, I started thinking about it more and now I feel less bummed out. Could've done with mentions of biblically accurate angels playing incomprehensible 24d MMOs or omniscient buddhas living in equianimous cosmic bliss but it still works.

This made me grok Löb's theorem enough to figure out how to exploit it for self-improvement purposes.

I don't have much to say, I just use the word "exfohazard" a lot. Would like a term for things that are not infohazardous in the way the basilisk is alleged, but close enough to the latter to cause distress to unprepared readers.

While not a particularly important concept in my mind it ends up being one of the ones I use the most, competitive with "Moloch", which is pretty impressive.

Motivated me to actually go out and do agent foundations so thumbs up!

This is one of those things that seems totally obvious after reading and makes you wonder how anyone thought otherwise but is somehow non-trivial anyways. 

I still endorse the breakdown of "sharp left turn" claims in this post. Writing this helped me understand the threat model better (or at all) and make it a bit more concrete.

This post could be improved by explicitly relating the claims to the "consensus" threat model summarized in Clarifying AI X-risk. Overall, SLT seems like a special case of that threat model, which makes a subset of the SLT claims: 

  • Claim 1 (capabilities generalize far) and Claim 3 (humans fail to intervene), but not Claims 1a/b (simultaneous / discontinuous generalization) or Claim
... (read more)

I continue to endorse this categorization of threat models and the consensus threat model. I often refer people to this post and use the "SG + GMG → MAPS" framing in my alignment overview talks. I remain uncertain about the likelihood of the deceptive alignment part of the threat model (in particular the requisite level of goal-directedness) arising in the LLM paradigm, relative to other mechanisms for AI risk. 

In terms of adding new threat models to the categorization, the main one that comes to mind is Deep Deceptiveness (let's call it Soares2), whi... (read more)

I'm glad I ran this survey, and I expect the overall agreement distribution probably still holds for the current GDM alignment team (or may have shifted somewhat in the direction of disagreement), though I haven't rerun the survey so I don't really know. Looking back at the "possible implications for our work" section, we are working on basically all of these things. 

Thoughts on some of the cruxes in the post based on last year's developments:

  • Is global cooperation sufficiently difficult that AGI would need to deploy new powerful technology to make it
... (read more)

Sazen turns out to be present everywhere once you see it. I'm noticing it in the news, in teaching, and in learning. I realized it in my FB posts about interesting concepts and on Twitter. I refer to Sazen not only in rationality discourse but have mentioned it to friends and family too. Being aware of Sazen helped me improve my communication. 

I continue to stand by this post.

I believe that in our studies of human cognition, we have relatively neglected the aggressive parts of it. We understand they're there, but they're kind of yucky and unpleasant, so they get relatively little attention. We can and should go into more detail, try to understand, harness and optimize aggression, because it is part of the brains that we're trying to run rationality on.

I am preparing another post to do this in more depth.

I think this post is emblematic of the problem I have with most of Val's writing: there are useful nuggets of insight here and there, but you're meant to swallow them along with a metric ton of typical mind fallacy, projection, confirmation bias, and manipulative narrativemancy.

Elsewhere, Val has written words approximated by ~"I tried for years to fit my words into the shape the rationalists wanted me to, and now I've given up and I'm just going to speak my mind."

This is what it sounds like when you are blind to an important distinction. Trying to hedge m... (read more)

I stand by what I said here: this post has a good goal but the implementation embodies exactly the issue it's trying to fight. 

An early paper that Anthropic then built on to produce their recent exciting results. I found the author's insight and detailed parameter tuning advice helpful.

The author seems aware of all the issues with why RL could be dangerous, but wants to use it anyway, rather than looking for alternatives: frankly it feels confused to me. I found it unhelpful.

Ideally reviews would be done by people who read the posts last year, so they could reflect on how their thinking and actions changed. Unfortunately, I only discovered this post today, so I lack that perspective.

Posts relating to the psychology and mental well being of LessWrongers are welcome and I feel like I take a nugget of wisdom from each one (but always fail to import the entirety of the wisdom the author is trying to convey.) 

 
The nugget from "Here's the exit" that I wish I had read a year ago is "If your body's emergency mobilization sys... (read more)

The post is influential, but makes multiple somewhat confused claims and led many people to become confused. 

The central confusion stems from the fact that genetic evolution already created a lot of control circuitry before inventing cortex, and did the obvious thing to 'align' the evolutionary newer areas: bind them to the old circuitry via interoceptive inputs. By this mechanism, genome is able to 'access' a lot of evolutionary relevant beliefs and mental models. The trick is the higher/more distant to genome models are learned in part to predict in... (read more)

I'm glad I did this project and wrote this up. When your goal is to make a thing to make the AI alignment community wiser, it's not really obvious how to tell if you're succeeding, and this was a nice step forward in doing that in a way that "showed my work". That said, it's hard to draw super firm conclusions, because of bias in who takes the survey and some amount of vagueness in the questions. Also, if the survey says a small number of people used a resource and all found it very useful, it's hard to tell if people who chose not to use the resource woul... (read more)

This short post is astounding because it succinctly describes, and prescribes, how to pay attention, to become grounded when a smart and sensitive human could end up engulfed in doom. The post is insightful and helpful to any of us in search of clarity and coping.

This 'medical miracle' story engenders hope. To suffer a chronic illness and to have no tangible answers regarding (diagnosis for too many), treatment, or treatment that works, is disheartening, frustrating and draining. Here is a detailed account of one quest that, having exhausted the standard intellectual-medical approach, remarkably results in relief. This writer gets deeply into her specific experience, and the reminder that there is a case for exploring intuition, along with unlikely luck, is uplifting.  

This is the kind of post I'd love to see more of on LW, so it's with a heavy heart I insist it not go into the review without edits. This post needs to mention interactions with medication much earlier and more prominently. It would be nice if you could count on people not taking suggestions like this until they'd read the entire post and done some of their own research, but you can't, and I think this caveat is important enough and easy enough to explain that it is worth highlighting at the very beginning. 

If you're using it regularly you also need to consider absorption of nutrients, although that's probably not a big deal for most people when used occasionally?

I think this post was a good exercise to clarify my internal model of how I expect the world to look like with strong AI. Obviously, most of the very specific predictions I make are too precise (which was clear at the time of writing) and won't play out exactly like that but the underlying trends still seem plausible to me. For example, I expect some major misuse of powerful AI systems, rampant automation of labor that will displace many people and rob them of a sense of meaning, AI taking over the digital world years before taking over the physical world ... (read more)

I still stand behind most of the disagreements that I presented in this post. There was one prediction that would make timelines longer because I thought compute hardware progress was slower than Moore's law. I now mostly think this argument is wrong because it relies on FP32 precision. However, lower precision formats and tensor cores are the norm in ML, and if you take them into account, compute hardware improvements are faster than Moore's law. We wrote a piece with Epoch on this: https://epochai.org/blog/trends-in-machine-learning-hardware

If anything, ... (read more)

It's been over a year since the original post and 7 months since the openphil revision.

A top level summary:

  1. My estimates for timelines are pretty much the same as they were.
  2. My P(doom) has gone down overall (to about 30%), and the nature of the doom has shifted (misuse, broadly construed, dominates).

And, while I don't think this is the most surprising outcome nor the most critical detail, it's probably worth pointing out some context. From NVIDIA:

In two quarters, from Q1 FY24 to Q3 FY24, datacenter revenues went from $4.28B to $14.51B.

From the post:

In 3 year

... (read more)

I think I still mostly stand behind the claims in the post, i.e. nuclear is undervalued in most parts of society but it's not as much of a silver bullet as many people in the rationalist / new liberal bubble would make it seem. It's quite expensive and even with a lot of research and de-regulation, you may not get it cheaper than alternative forms of energy, e.g. renewables. 

One thing that bothered me after the post is that Johannes Ackva (who's arguably a world-leading expert in this field) and Samuel + me just didn't seem to be able to communicate w... (read more)

In a narrow technical sense, this post still seems accurate but in a more general sense, it might have been slightly wrong / misleading. 

In the post, we investigated different measures of FP32 compute growth and found that many of them were slower than Moore's law would predict. This made me personally believe that compute might be growing slower than people thought and most of the progress comes from throwing more money at larger and larger training runs. While most progress comes from investment scaling, I now think the true effective compute growth... (read more)

I haven't talked to that many academics about AI safety over the last year but I talked to more and more lawmakers, journalists, and members of civil society. In general, it feels like people are much more receptive to the arguments about AI safety. Turns out "we're building an entity that is smarter than us but we don't know how to control it" is quite intuitively scary. As you would expect, most people still don't update their actions but more people than anticipated start spreading the message or actually meaningfully update their actions (probably still less than 1 in 10 but better than nothing).

Since this post was written, OpenAI has done much more to communicate its overall approach to safety, making this post somewhat obsolete. At the time, I think it conveyed some useful information, although it was perceived as more defensive than I intended.

My main regret is bringing up the Anthropic split, since I was not able to do justice to the topic. I was trying to communicate that OpenAI maintained its alignment research capacity, but should have made that point without mentioning Anthropic.

Ultimately I think the post was mostly useful for sparking some interesting discussion in the comments.

Since this post was written, I feel like there's been a zeitgeist of "Distillation Projects." I don't know how causal this post was, I think in some sense the ecosystem was ripe for a Distillation Wave) But it seemed useful to think about how that wave played out.

Some of the results have been great. But many of the results have felt kinda meh to me, and I now have a bit of a flinch/ugh reaction when I see a post with "distillation" in it's title. 

Basically, good distillations are a highly skilled effort. It's sort of natural to write a distillation of... (read more)

[This is a self-review because I see that no one has left a review to move it into the next phase. So8res's comment would also make a great review.]

I'm pretty proud of this post for the level of craftsmanship I was able to put into it. I think it embodies multiple rationalist virtues. It's a kind of "timeless" content, and is a central example of the kind of content people want to see on LW that isn't stuff about AI.

It would also look great printed in a book. :)

I think this post makes a true and important point, a point that I also bring up from time to time.

I do have a complaint though: I think the title (“Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc”) is too strong. (This came up multiple times in the comments.)

In particular, suppose it takes N unlabeled parameters to solve a problem with deep learning, and it takes M unlabeled parameters to solve the same problem with probabilistic programming. And suppose that M<N, or even M<<N, which I think is generally plausible.

If P... (read more)

I found this post extremely clear, thoughtful, and enlightening, and it’s one of my favorite lesswrong (cross)posts of 2022. I have gone back and reread it at least a couple times since it was first posted, and I cited it recently here.

The references to research for the clarification and countering of assertions, made in a previous piece on sleep, allows for useful knowledge sharing. And the examples of the effects of sleep deprivation are mostly hilarious!

Retrospective: I think this is the most important post I wrote in 2022. I deeply hope that more people benefit by fully integrating these ideas into their worldviews. I think there's a way to "see" this lesson everywhere in alignment: for it to inform your speculation about everything from supervised fine-tuning to reward overoptimization. To see past mistaken assumptions about how learning processes work, and to think for oneself instead. This post represents an invaluable tool in my mental toolbelt.

I wish I had written the key lessons and insights more p... (read more)

Simply by the author stating and exploring examples of 'greyed out options' one is reminded of possible choices, some of which may benefit the reader. Feeling stuck, fettered, having little control, direction, without meaning or purpose, struggling, or subject to ennui, or varying degrees of stress or anxiety, might be helped by considering physical/psychological, actionable changes in behaviour. Trying new stuff/things/ways of being, may be at the edges of one's thought or comfort zone; the reader is gently reminded to look. This writing gives me good pause; it encourages the act of reflecting on personal possibilities, and subsequent impetus to pursue something novel, (thereby challenging) which may be life enhancing.

This post's point still seems correct, and it still seems important--I refer to it at least once a week.

Origin and summary

This post arose from a feeling in a few conversations that I wasn't being crisp enough or epistemically virtuous enough when discussing the relationship between gradient-based ML methods and natural selection/mutate-and-select methods. Some people would respond like, 'yep, seems good', while others were far less willing to entertain analogies there. Clearly there was some logical uncertainty and room for learning, so I decided to 'math it out' and ended up clarifying a few details about the relationship, while leaving a lot unresolved. E... (read more)

An excellent article that gives a lot of insight into LLMs. I consider it a significant piece of deconfusion.

I don't have any substantive comment to provide at the moment, but I want to share that this is the post that piqued my initial interest in alignment. It provided a fascinating conceptual framework around how we can qualitatively describe the behavior of LLMs, and got me thinking about implications of more powerful future models. Although it's possible that I would eventually become interested in alignment, this post (and simulator theory broadly) deserve a large chunk of the credit. Thanks janus.

End-of-2023 author retrospective: 

Yeah, this post holds up. I'm proud of it. The Roman dodecahedron and the fox lady still sit proudly on my desk.

I got the oldest known example wrong, but this was addressed in the sequel post: Who invented knitting? The plot thickens. If you haven't read the sequel, where I go looking for the origins of knitting, you will enjoy it. Yes, even if you're here for the broad ideas about history rather than specifically knitting. (That investigation ate my life for a few months in there. Please read it. 🥺)

I'm extremely ple... (read more)

(This is a review of the entire sequence.)

On the day when I first conceived of this sequence, my room was covered in giant graph paper sticky notes. The walls, the windows, the dressers, the floor. Sticky pads everywhere, and every one of them packed with word clouds and doodles in messy bold marker.

My world is rich. The grain of the wood on the desk in front of me, the slightly raw sensation inside my nostrils that brightens each time I inhale, the pressure of my search for words as I write that rises up through my chest and makes my brain feel like it’s ... (read more)

[This is a review for the whole sequence.]

I think of LessWrong as a place whose primary purpose is and always has been to develop the art of rationality. One issue is that this mission tends to attract a certain kind of person -- intelligent, systematizing, deprioritizing social harmony, etc -- and that can make it harder for other kinds of people to participate in the development of the art of rationality. But rationality is for everyone, and ideally the art would be equally accessible to all.

This sequence has many good traits, but one of the most disting... (read more)

This paper, like others from Anthropic, is is exemplary science and exceptional science communication. The authors are clear, precise and thorough. It is evident that their research motivation is to solve a problem, and not to publish a paper, and that their communication motivation is to help others understand, and not to impress.

I think this post paints a somewhat inaccurate view of the past.

The post claims that MIRI's talk of recursive self-improvement from a seed AI came about via MIRI’s attempts to respond to claims such as "AI will never exceed human capabilities" or "Growth rates post AI will be like growth rates beforehand." Thus, the post says, people in MIRI spoke of recursive self-improvement from a seed AI not because they thought this was a particularly likely mainline future -- but because they thought this was one obvious way that AI -- past a certain level of develop... (read more)

i think, in retrospect, this feature was a really great addition to the website.

A nice write-up of something that actually matters, specifically the fact that so many people rely on a general factor of doom to answer questions.

While I think there is reason to have correlation between these factors, and a very weak form of the General Factor of Doom is plausible, I overall like the point that people probably use a General Factor of Doom too much, and offers some explanations of why.

Overall, I'd give this a +1 or +4. A reasonably useful post that talks about a small thing pretty competently.

This post demonstrates another surface of the important interplay between our "logical" (really just verbal) part-of-mind and our emotional part-of-mind. Other posts on this site, including by Kaj Sotala and Valentine, go into this interplay and how our rationality is affected by it.

It's important to note, both for ourselves and for our relationships with others, that the emotional part is not something that can be dismissed or fought with, and I think this post does well in explaining an important facet of that. Plus, when we're shown the possible pitfalls ahead of any limerence, we can be more aware of it when we do fall in love, which is always nice.

I think this point is very important, and I refer to it constantly.

I wish that I'd said "the prototypical AI catastrophe is either escaping from the datacenter or getting root access to it" instead (as I noted in a comment a few months ago).

I think this point is really crucial, and I was correct to make it, and it continues to explain a lot of disagreements about AI safety.

This post is very cute. I also reference it all the time to explain the 'inverse cat tax.' you You can ask my colleagues, I definitely talk about that model a bunch. So, perhaps strangely, this is my most-referenced post of 2022. 🙃

My explanation of a model tax: this forum (and the EA Forum) really like models, so to get a post to be popular, you gotta put in a model.

I've referenced this post several times. I think the post has to balance being a straw vulcan with being unwilling to forcefully say its thesis, and I find Raemon to be surprisingly good at saying true things within that balance. It's also well-written, and a great length. Candidate for my favorite post of the year.

Comments on the outcomes of the post:

  • I'm reasonably happy with how this post turned out. I think it probably bought the Anthropic/superposition mechanistic interpretability agenda somewhere between 0.1 to 4 counterfactual months of progress, which feels like a win.
  • I think sparse autoencoders are likely to be a pretty central method in mechanistic interpretability work for the foreseeable future (which tbf is not very foreseeable).
  • Two parallel works used the method identified in the post (sparse autoencoders - SAEs) or slight modification:
    • Cunningham et al.
... (read more)

So, most people see sleep as something that's obviously beneficial, but this post was great at sparking conversation about this topic, and questioning that assumption about whether sleep is good. It's well-researched and addresses many of the pro-sleep studies and points regarding the issue. 

I'll like to see people do more studies on the effects of low sleep on other diseases or activities. There's many good objections in the comments, such as increased risk of Alzheimer's, driving while sleepy and how the analogy of sleep deprivation to fasting may b... (read more)

[This comment is no longer endorsed by its author]Reply

This post expresses an important idea in AI alignment that I have essentially believed for a long time, and which I have not seen expressed elsewhere. (I think a substantially better treatment of the idea is possible, but this post is fine, and you get a lot of points for being the only place where an idea is being shared.)

Earlier this year I spent a lot of time trying to understand how to do research better. This post was one of the few resources that actually helped. It described several models that I resonated with, but which I had not read anywhere else. It essentially described a lot of the things I was already doing, and this gave me more confidence in deciding to continue doing full time AI alignment research. (It also helps that Karnofsky is an accomplished researcher, and so his advice has more weight!)

I really like this post because it directly clarified my position on ethics, namely making me abandon unbounded utilities. I want to give this post a Δ and +4 for doing that, and for being clearly written and fairly short.

This post has been surprisingly important to me, and has made me notice how I was confused around what motivation is, conceptually. I've used Steam as a concept maybe once a week, both when introspecting during meditation and when thinking about AI alignment.

I remember three different occasions where I've used steam:

  1. Performing a conceptual analysis of "optimism" in this comment, in which I think I've clarified some of the usage of "optimism", and why I feel frustrated by the word.
  2. When considering whether to undertake a risky and kind of for-me-out-of-di
... (read more)

I just gave this a re-read, I forgot what a trip it is to read the thoughts of Eliezer Yudkowsky. It continues to be some of my favorite stuff in recent years written on LessWrong.

It's hard to relate to the world with a level of mastery over basic ideas as Eliezer has. I don't mean with this to vouch that his perspective is certainly correct, but I believe it is at least possible, and so I think he aspires to a knowledge of reality that I rarely if ever aspire to. Reading it inspires me to really think about how the world works, and really figure out what I know and what I don't. +9

(And the smart people dialoguing with him here are good sports for keeping up their side of the argument.)

I was pleasantly surprised by how many people enjoyed this post about mountain climbing. I never expected it to gain so much traction, since it doesn't relate that clearly to rationality or AI or any of the topics usually discussed on LessWrong.

But when I finished the book it was based on, I just felt an overwhelming urge to tell other people about it. The story was just that insane.

Looking back I think Gwern probably summarized what this story is about best: a world beyond the reach of god. The universe does not respect your desire for a coherent, meaning... (read more)

My review mostly concerns the SMTM's A Chemical Hunger part of this review. RaDVaC was interesting if not particularly useful, but SMTM's series has been noted by many commenters to be a strange theory, possibly damaging, and there were, as of my last check, no response by SMTM to the various rebuttals.

It does not behoove rationalism to have members that do not respond to critical looks at their theories. They stand to do a lot of damage and cost a lot of lives if taken seriously.

This post helped me distinguish between having good reasons for my beliefs, and being able to clearly communicate and explain my reasoning, and (to me) painted the latter as pro-social and as a virtue rather than a terrible cost I was expected to pay.

Strong agree, crucial points, +4.

This isn't a post that I feel compelled to tell everyone they need to read, but nonetheless the idea immediately entered my lexicon, and I use it to this day, which is a pretty rare feat. +4

(Sometimes, instead of this phrase, I say that I'm feeling precious about my idea.)

A few points:

  • This post caused me to notice the ways that I find emergencies attractive. I am myself drawn to scenarios where all the moves are forced moves. If there's a fire, there's no uncertainty, no awkwardness, I just do whatever I can right now to put it out. It's like reality is just dragging me along and I don't really have to take responsibility for the rest of it, because all the tradeoffs are easy and any harder evaluations must be tightly time-bounded. I have started to notice the unhealthy ways in which I am drawn to things that have this natu
... (read more)

Guzey substantially retracted this a year later. I think it would be great to publish both together as a case study of self-experimentation, but would be against publishing this on its own. 

Get minimum possible sustainable amount of sleep -> get enough sleep to have maximum energy during the day

Sleep makes me angry. I mean, why on Earth do I have to spend hours every day lying around unconscious?????????

In 2019, trying to learn about the science behind sleep I read Why We Sleep and got so angry at it for being essentially pseudoscience that I spent

... (read more)

This post was actually quite elightening, and felt more immediately helpful for me in understanding the AI X-risk case than any other single document. I think that's because while other articles on AI risk can seem kind of abstract, this one considers concretely what kind of organization would be required for navigating around alignment issues, in the mildly inconvenient world where alignment is not solved without some deliberate effort, which put me firmly into near-mode.

This was my favorite non-AI post of 2022; perhaps partly because the story of Toni Kurz serves as a metaphor for the whole human enterprise. Reading about these mens' trials was both riveting and sent me into deep reflection about my own values.