I feel like many people look at AI alignment like they think the main problem is being careful enough when we train the AI so that no bugs cause the objective to misgeneralize.
This is not the main problem. The main problem is that it is likely significantly easier to build an AGI than to build an aligned AI or a corrigible AI. Even if it's relatively obvious that AGI design X destroys the world, and all the wise actors don't deploy it, we cannot prevent unwise actors to deploy it a bit later.
We currently don't have any approach to alignment that would work even if we managed to implement everything correctly and had perfect datasets.
(This is a repost of my comment on John's "My AI Model Delta Compared To Yudkowsky" post which I wrote a few months ago. I think points 2-6 (especially 5 and 6) describe important and neglected difficulties of AI alignment.)
My model (which is pretty similar to my model of Eliezer's model) does not match your model of Eliezer's model. Here's my model, and I'd guess that Eliezer's model mostly agrees with it:
(This is an abridged version of my comment here, which I think belongs on my shortform. I removed some examples which were overly long. See the original comment for those.)
Here are some lessons I learned over the last months from doing alignment research on trying to find the right ontology for modelling (my) cognition:
Tbc those are sorta advanced techniques. Most alignment researchers are working on line of hopes that pretty obviously won't work while thinking it has a decent chance of working, and I wouldn't expect those techniques to be much use for them.
There is this quite foundational skill of "notice when you're not making progress / when your proposals aren't actually good" which is required for further improvement, and I do not know how to teach this. It's related to be very concrete and noticing mysterious answers or when you're too abstract or still confused. It might sorta be what Eliezer calls security mindset.
(Also other small caveat: I did not yet get very clear great results out of my research, but I do think I am making faster progress (and I'm setting myself a very high standard). I'd guess the lessons can probably be misunderstood and misapplied, but idk.)
In case some people relatively new to lesswrong aren't aware of it. (And because I wish I found that out earlier): "Rationality: From AI to Zombies" does not nearly cover all of the posts Eliezer published between 2006 and 2010.
Here's how it is:
So a sizeable fraction of EY's posts are not in a collection.
I just recently started reading the rest.
I strongly recommend reading:
And generally a lot of posts on AI (i.e. primarily posts in the AI foom debate) are not in the sequences. Some of them were pretty good.