Although I soft upvoted this post, there are some notions I'm uncomfortable with.
What I agree with:
The third point is likely due to the fact that the Yudkowskian alignment paradigm isn't a particularly fun one. It is easy to dismiss great ideas for other great ideas when the other ideas promise lower x-risk. This applies in both directions however, as it's far easier to succumb to extreme views (I don't mean to use this term in a diminishing fashion) like "we are all going to absolutely die" or "this clever scheme will reduce our x-risk to 1%" and miss the antimeme hiding in plain sight. A perfect example of this is in my mind is the comment section of the Death with Dignity post.
I worry that posts like this discourage content that does not align with the Yudkowskian paradigm, which are likely just as important as posts that conform to it. I don't find ideas like Shard Theory or their consequential positive reception alarming or disappointing, and on the contrary I find their presentation meaningful and valuable, regardless of whether or not they are correct (not meant to imply I think that Shard Theory is incorrect, it was merely an example). The alternative to posting potentially incorrect ideas (a category that encompasses most ideas) is to have them never scrutinized, improved upon or falsified. Furthermore, incorrect ideas and their falsification can still greatly enrich the field of alignment, and there is no reason why an incorrect interpretation of agency for example couldn't still produce valuable alignment insights. Whilst we likely cannot iterate upon aligning AGI, alignment ideas are an area in which iteration can be applied, and we would be fools not to apply such a powerful tool broadly. Ignoring the blunt argument of "maybe Yudkowsky is wrong", it seems evident that "non-Yudkowskian" ideas (even incorrect ones) should be a central component of LessWrong's published alignment research, this seems to me the most accelerated path toward being predictably wrong less often.
To rephrase, is it the positive reception non-Yudkowskian ideas that alarm/disappoint you, or the positive reception of ideas you believe have a high likelihood of being incorrect (which happens to correlate positively with non-Yudkowskian ideas)?
I assume your answer will be the latter, and if so then I don't think the correct point to press is whether or not ideas conform to views associated with a specific person, but rather ideas associated with falsity. Let me know what you think, as I share most of your concerns.
Mmm, my intent is not to discourage people from posting views I disagree with, and I don't think this post will have that effect.
It's more like, I see a lot of posts that could be improved by grappling more directly with Yudkowskian ideas. To the credit of many of the authors I link, they often do this, though not always as much as I'd like or in ways I think are correct.
The part I find lacking in the discourse is pushback from others, which is what I'm hoping to change. That pushback can't happen if people don't make the posts in the first place!
I've grown increasingly alarmed and disappointed by the number of highly-upvoted and well-received posts on AI, alignment, and the nature of intelligent systems, which seem fundamentally confused about certain things.
Can you elaborate on how all these linked pieces are "fundamentally confused"? I'd like to see a detailed list of your objections. It's probably best to make a separate post for each one.
I think commenting is a more constructive way of engaging in many cases. Before and since publishing this post, I've commented on some of the pieces I linked (or related posts or subthreads).
I've also made one top-level post which is partially an objection to the characterization of alignment that I think is somewhat common among many of the authors I linked. Some of these threads have resulted in productive dialogue and clarity, at least from my perspective.
Links:
There are probably some others in my comment history. Most of these aren't fundamental objections to the pieces they respond to, but they gesture at the kind of thing I am pointing to in this post.
If I had to summarize (without argument) the main confusions as I see them:
This post is a look back on my first month or so as an active contributor on LessWrong, after lurking for over a decade. My experience so far has been overwhelmingly positive, and one purpose of this post is to encourage other lurkers to do the same.
The reason I decided to start posting, in a nutshell:
For the last 10 years or so, I've been following Eliezer's public writing and nodding along in silent agreement with just about everything he says.
I mostly didn't feel like I had much to contribute to the discussion, at least not enough to overcome the activation energy required to post, which for me seems to be pretty high.
However, over the last few years and especially the last few months, I've grown increasingly alarmed and disappointed by the number of highly-upvoted and well-received posts on AI, alignment, and the nature of intelligent systems, which seem fundamentally confused about certain things. I think (what I perceive as) these misunderstandings and confusion are especially prominent in posts which reject all or part of the Yudkowskian view of intelligence and alignment.
I notice Eliezer's own views seem to be on the outs with some fraction of prominent posters these days. One hypothesis for this is that Eliezer is actually wrong about a lot of things, and that people are right to treat his ideas with skepticism.
Reading posts and comments from both Eliezer and his skeptics though, I find this hypothesis unconvincing. Eliezer may sometimes be wrong about important things, but his critics don't seem to be making a very strong case.
(I realize the paragraphs above are potentially controversial. My intent is not to be inflammatory or to attack anyone. My goal in this post is simply to be direct about my own beliefs, without getting too much into the weeds about why I hold them.)
My first few posts and comments have been an attempt to articulate my own understanding of some concepts in AI and alignment which I perceive as widely misunderstood. My goal is to build a foundation from which to poke and prod at some of the Eliezer-skeptical ideas, to see if I have a knack for explaining where others have failed. Or, alternatively, to see if I am the one missing something fundamental, which becomes apparent through more active engagement.
Overview of my recent posts
This section is an overview of my posts so far, ranked by which ones I think are the most worth reading.
Most of my posts assume some background familiarity, if not agreement with, Yudkowskian ideas about AI and alignment. This makes them less accessible as "101 explanations", but allows me to wade a bit deeper into the weeds without getting bogged down in long introductions.
Steering systems
My longest and most recent post, and the one that I am most proud of.
As of publishing this piece, it has gotten a handful of strong and weak upvotes, and zero downvotes. I'm not sure if this indicates it dropped off the front page before it could get more engagement, or if it was simply not interesting enough per-word for most people in its target audience to read to the end and vote on it.
The main intuition I wanted to convey in this post is how powerful systems might be constructed in the near future, by composing "non-agentic" foundation models in relatively simple ways. And further, that there are ways this leads to extreme danger / failure even before we get to the point of having to worry about even more powerful systems reflecting, deceiving, power-seeking, or exhibiting other more exotic examples of POUDA.
I'll highlight one quote from this piece, which I think is a nice distillation of a key insight for making accurate predictions about how the immediate future of LLMs is likely to play out:
Gradual takeoff, fast failure
My first post, and the precursor for "Steering systems". Looking back, I don't think there's much here that's novel or interesting, but it's a briefer introduction to some of the ways I think about things in "Steering systems".
The post is about some ways I see potential for catastrophic failure before the failure modes that arise when dealing with the kinds of systems that MIRI and other hard-takeoff research groups tend to focus on. I think if we somehow make it past those failure modes though, we'll still end up facing the harder problems of hard takeoff.
Grinding slimes in the dungeon of AI alignment research
This post attempts to articulate a metaphor for the different ways different kinds of alignment research might contribute to increasing or decreasing x-risk.
I still like this post, but looking back, I think I should have explained the metaphor in more detail, for people who aren't familiar with RPGs. Also, "grinding in the slime dungeons" might have been perceived as negative or dismissive of alignment research focused on current AI systems, which I didn't intend. I do think we are in the "early game" of AI systems and alignment, and slimes are a common early-game enemy in RPGs. That was the extent of the point I was trying to make with that part of the analogy.
Instantiating an agent with GPT-4 and text-davinci-003
This was mostly just my own fun attempt at experimenting with GPT-4 when I first got access. Others have done similar, more impressive things, but doing the experiment and writing the post gave me a better intuitive understanding of GPT-4's capabilities and the potential ways that LLMs can be arranged and composed into more complex systems. I think constructions like the one in this Twitter thread demonstrate the point I was trying to make in a more concrete and realistic way.
Takeaways and observations
I don't think this is a major problem - I'm not here to farm karma or maximize engagement, and my higher-effort posts and comments tend to have a smaller target audience.
More broadly, I don't think the flood of high-engagement but less technically deep posts on LW are crowding out more substantive posts (either my own or others) in a meaningful way. (Credit to the LW development team for building an excellent browsing UX.)
I do think there is a flood of more substantive posts that do crowd each other out, to some degree - I spend a fair amount of time reading and voting on more substantive new submissions, and still feel like there's a lot of good stuff that I'm missing due to time constraints.
Miscellaneous concluding points
Object-level discourse on these claims about AI alignment and differing viewpoints in the comments of this post is fine with me, though I might not engage with them immediately (or at all) if the volume is high, or even if it isn't.