This post is a look back on my first month or so as an active contributor on LessWrong, after lurking for over a decade. My experience so far has been overwhelmingly positive, and one purpose of this post is to encourage other lurkers to do the same.

The reason I decided to start posting, in a nutshell:

https://xkcd.com/386/ "Duty Calls" ("Someone is _wrong_ on the internet")
xkcd 386: "Duty Calls"

For the last 10 years or so, I've been following Eliezer's public writing and nodding along in silent agreement with just about everything he says.

I mostly didn't feel like I had much to contribute to the discussion, at least not enough to overcome the activation energy required to post, which for me seems to be pretty high.

However, over the last few years and especially the last few months, I've grown increasingly alarmed and disappointed by the number of highly-upvoted and well-received posts on AI, alignment, and the nature of intelligent systems, which seem fundamentally confused about certain things. I think (what I perceive as) these misunderstandings and confusion are especially prominent in posts which reject all or part of the Yudkowskian view of intelligence and alignment.

I notice Eliezer's own views seem to be on the outs with some fraction of prominent posters these days. One hypothesis for this is that Eliezer is actually wrong about a lot of things, and that people are right to treat his ideas with skepticism.

Reading posts and comments from both Eliezer and his skeptics though, I find this hypothesis unconvincing. Eliezer may sometimes be wrong about important things, but his critics don't seem to be making a very strong case.

(I realize the paragraphs above are potentially controversial. My intent is not to be inflammatory or to attack anyone. My goal in this post is simply to be direct about my own beliefs, without getting too much into the weeds about why I hold them.)

My first few posts and comments have been an attempt to articulate my own understanding of some concepts in AI and alignment which I perceive as widely misunderstood. My goal is to build a foundation from which to poke and prod at some of the Eliezer-skeptical ideas, to see if I have a knack for explaining where others have failed. Or, alternatively, to see if I am the one missing something fundamental, which becomes apparent through more active engagement.

Overview of my recent posts

This section is an overview of my posts so far, ranked by which ones I think are the most worth reading.

Most of my posts assume some background familiarity, if not agreement with, Yudkowskian ideas about AI and alignment. This makes them less accessible as "101 explanations", but allows me to wade a bit deeper into the weeds without getting bogged down in long introductions.

Steering systems

My longest and most recent post, and the one that I am most proud of.

As of publishing this piece, it has gotten a handful of strong and weak upvotes, and zero downvotes. I'm not sure if this indicates it dropped off the front page before it could get more engagement, or if it was simply not interesting enough per-word for most people in its target audience to read to the end and vote on it.

The main intuition I wanted to convey in this post is how powerful systems might be constructed in the near future, by composing "non-agentic" foundation models in relatively simple ways. And further, that there are ways this leads to extreme danger / failure even before we get to the point of having to worry about even more powerful systems reflecting, deceiving, power-seeking, or exhibiting other more exotic examples of POUDA.

I'll highlight one quote from this piece, which I think is a nice distillation of a key insight for making accurate predictions about how the immediate future of LLMs is likely to play out:

Training GPT-4 was the work of hundreds of engineers and millions of dollars of computing resources by OpenAI. LangChain is maintained by a very small team. And a single developer can write a python script which glues together chains of OpenAI API calls into a graph. Most of the effort was in training the LLM, but most of the agency (and most of the useful work) comes from the relatively tiny bit of glue code that puts them all together at the end.

Gradual takeoff, fast failure

My first post, and the precursor for "Steering systems". Looking back, I don't think there's much here that's novel or interesting, but it's a briefer introduction to some of the ways I think about things in "Steering systems".

The post is about some ways I see potential for catastrophic failure before the failure modes that arise when dealing with the kinds of systems that MIRI and other hard-takeoff research groups tend to focus on. I think if we somehow make it past those failure modes though, we'll still end up facing the harder problems of hard takeoff.

Grinding slimes in the dungeon of AI alignment research

This post attempts to articulate a metaphor for the different ways different kinds of alignment research might contribute to increasing or decreasing x-risk.

I still like this post, but looking back, I think I should have explained the metaphor in more detail, for people who aren't familiar with RPGs. Also, "grinding in the slime dungeons" might have been perceived as negative or dismissive of alignment research focused on current AI systems, which I didn't intend. I do think we are in the "early game" of AI systems and alignment, and slimes are a common early-game enemy in RPGs. That was the extent of the point I was trying to make with that part of the analogy.

Instantiating an agent with GPT-4 and text-davinci-003

This was mostly just my own fun attempt at experimenting with GPT-4 when I first got access. Others have done similar, more impressive things, but doing the experiment and writing the post gave me a better intuitive understanding of GPT-4's capabilities and the potential ways that LLMs can be arranged and composed into more complex systems. I think constructions like the one in this Twitter thread demonstrate the point I was trying to make in a more concrete and realistic way.

Takeaways and observations

  • Writing is hard, writing well is harder. I have a much greater appreciation for prolific writers who manage to produce high quality, coherent, and insightful posts on a regular basis, whether I agree with their conclusions or not.
  • Engaging and responding with critical and differing views is also hard. Whether someone responds to a particular commenter or even a highly-upvoted post with differing views seems like very little evidence about whether their own ideas are valid and correct, and more evidence about how much energy and time someone has to engage.
  • The posts I'm most proud of are not the ones that got the most karma. Most of my karma comes from throwaway comments on popular linkposts, and my own most highly-upvoted submission is a podcast link.

    I don't think this is a major problem - I'm not here to farm karma or maximize engagement, and my higher-effort posts and comments tend to have a smaller target audience.

    More broadly, I don't think the flood of high-engagement but less technically deep posts on LW are crowding out more substantive posts (either my own or others) in a meaningful way. (Credit to the LW development team for building an excellent browsing UX.)

    I do think there is a flood of more substantive posts that do crowd each other out, to some degree - I spend a fair amount of time reading and voting on more substantive new submissions, and still feel like there's a lot of good stuff that I'm missing due to time constraints.
  • I encourage other longtime lurkers to consider becoming active. Even if you initially get low engagement or downvotes, as long as you understand and respect the norms of the community, your participation will be welcome. My experience so far has been overwhelmingly positive, and I wish I had started sooner.
  • The "Get feedback" feature exists and is great. I didn't use it for this post, but Justis from the LW moderation team gave me some great feedback on Steering systems, which I think made the post stronger. 

Miscellaneous concluding points

  • I welcome any feedback or engagement with my existing posts, even if it's not particularly constructive. Also welcome are any ideas for future posts or pieces to comment on, though I have many of my own ideas already.
  • I realize that I made some controversial claims in the intro, and left them totally unsupported. Again, my intent is not to be inflammatory; the point here is just to stake out my own beliefs as concisely and clearly as possible.

    Object-level discourse on these claims about AI alignment and differing viewpoints in the comments of this post is fine with me, though I might not engage with them immediately (or at all) if the volume is high, or even if it isn't.
  • Despite my somewhat harsh words, I still think LW is the best place on the internet for rationality and sane discourse on AI and alignment (and many other topics), and no where else comes close.
  • My real-life identity is not secret, though I'd prefer for now that my LW postings not be attached to my full name when people Google me for other reasons. PM me here or on Discord (m4xed#7691) if you want to know who I am. (Despite my past inclination to online lurking, I've been a longtime active participant in the meatspace rationality community in NYC. 👋)
New Comment
4 comments, sorted by Click to highlight new comments since:

Although I soft upvoted this post, there are some notions I'm uncomfortable with. 

What I agree with:

  • Longtime lurkers should post more
  • Less technical posts are pushing more technical posts out of the limelight
  • Posts that dispute the Yudkowskian alignment paradigm are more likely to contain incorrect information (not directly stated but heavily implied I believe, please correct me if I've misinterpreted)
  • Karma is not an indicator of correctness or of value

The third point is likely due to the fact that the Yudkowskian alignment paradigm isn't a particularly fun one. It is easy to dismiss great ideas for other great ideas when the other ideas promise lower x-risk. This applies in both directions however, as it's far easier to succumb to extreme views (I don't mean to use this term in a diminishing fashion) like "we are all going to absolutely die" or "this clever scheme will reduce our x-risk to 1%" and miss the antimeme hiding in plain sight. A perfect example of this is in my mind is the comment section of the Death with Dignity post. 

I worry that posts like this discourage content that does not align with the Yudkowskian paradigm, which are likely just as important as posts that conform to it. I don't find ideas like Shard Theory or their consequential positive reception alarming or disappointing, and on the contrary I find their presentation meaningful and valuable, regardless of whether or not they are correct (not meant to imply I think that Shard Theory is incorrect, it was merely an example). The alternative to posting potentially incorrect ideas (a category that encompasses most ideas) is to have them never scrutinized, improved upon or falsified. Furthermore, incorrect ideas and their falsification can still greatly enrich the field of alignment, and there is no reason why an incorrect interpretation of agency for example couldn't still produce valuable alignment insights. Whilst we likely cannot iterate upon aligning AGI, alignment ideas are an area in which iteration can be applied, and we would be fools not to apply such a powerful tool broadly. Ignoring the blunt argument of "maybe Yudkowsky is wrong", it seems evident that "non-Yudkowskian" ideas (even incorrect ones) should be a central component of LessWrong's published alignment research, this seems to me the most accelerated path toward being predictably wrong less often. 

To rephrase, is it the positive reception non-Yudkowskian ideas that alarm/disappoint you, or the positive reception of ideas you believe have a high likelihood of being incorrect (which happens to correlate positively with non-Yudkowskian ideas)? 

I assume your answer will be the latter, and if so then I don't think the correct point to press is whether or not ideas conform to views associated with a specific person, but rather ideas associated with falsity. Let me know what you think, as I share most of your concerns. 

Mmm, my intent is not to discourage people from posting views I disagree with, and I don't think this post will have that effect.

It's more like, I see a lot of posts that could be improved by grappling more directly with Yudkowskian ideas. To the credit of many of the authors I link, they often do this, though not always as much as I'd like or in ways I think are correct.

The part I find lacking in the discourse is pushback from others, which is what I'm hoping to change. That pushback can't happen if people don't make the posts in the first place!

I've grown increasingly alarmed and disappointed by the number of highly-upvoted and well-received posts on AI, alignment, and the nature of intelligent systems, which seem fundamentally confused about certain things.

Can you elaborate on how all these linked pieces are "fundamentally confused"? I'd like to see a detailed list of your objections. It's probably best to make a separate post for each one.

I think commenting is a more constructive way of engaging in many cases. Before and since publishing this post, I've commented on some of the pieces I linked (or related posts or subthreads).

I've also made one top-level post which is partially an objection to the characterization of alignment that I think is somewhat common among many of the authors I linked. Some of these threads have resulted in productive dialogue and clarity, at least from my perspective. 

Links:

There are probably some others in my comment history. Most of these aren't fundamental objections to the pieces they respond to, but they gesture at the kind of thing I am pointing to in this post. 

If I had to summarize (without argument) the main confusions as I see them:

  • An implicit or explicit assumption that near-future intelligent systems will look like current DL-paradigm research artifacts. (This is partially what this post is addressing.)
  • I think a lot of people mostly accept orthogonality and instrumental convergence, without following the reasoning through or engaging directly with all of the conclusions they imply.  I think this leads to a view that explanations of human value formation or arguments based on precise formulations of coherence have more to say about near-future intelligent systems than is actually justified. Or at least, that results and commentary about these things are directly relevant as objections to arguments for danger based on consequentialism and goal-directedness more generally. (I haven't expanded on this in a top-level post yet, but it is addressed obliquely by some of the comments and posts in my history.)