This is a special post for quick takes by Towards_Keeperhood. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
4 comments, sorted by Click to highlight new comments since:

I feel like many people look at AI alignment like they think the main problem is being careful enough when we train the AI so that no bugs cause the objective to misgeneralize.

This is not the main problem. The main problem is that it is likely significantly easier to build an AGI than to build an aligned AI or a corrigible AI. Even if it's relatively obvious that AGI design X destroys the world, and all the wise actors don't deploy it, we cannot prevent unwise actors to deploy it a bit later.

We currently don't have any approach to alignment that would work even if we managed to implement everything correctly and had perfect datasets.

(This is a repost of my comment on John's "My AI Model Delta Compared To Yudkowsky" post which I wrote a few months ago. I think points 2-6 (especially 5 and 6) describe important and neglected difficulties of AI alignment.)

My model (which is pretty similar to my model of Eliezer's model) does not match your model of Eliezer's model. Here's my model, and I'd guess that Eliezer's model mostly agrees with it:

  1. Natural abstractions (very) likely exist in some sense. Concepts like "chair" and "temperature" and "carbon" and "covalent bond" all seem natural in some sense, and an AI might model them too (though perhaps at significantly superhuman levels of intelligence it rather uses different concepts/models). (Also it's not quite as clear whether such natural abstractions actually apply very well to giant transformers (though still probable in some sense IMO, but it's perhaps hard to identify them and to interpret what "concepts" actually are in AIs).)
  2. Many things we value are not natural abstractions, but only natural relative to a human mind design. Emotions like "awe" or "laughter" are quite complex things evolved by evolution, and perhaps minds that have emotions at all are just a small space in minddesignspace. The AI doesn't have built-in machinery for modelling other humans the way humans model other humans. It might eventually form abstractions for the emotions, but probably not in a way it understands "how the emotion feels from the inside".
    1. There is lots of hidden complexity in what determines human values. Trying to point an AI to human values directly (in a similar way to how humans are pointed to their values) would be incredibly complex. Specifying a CEV process / modelling one or multiple humans and identifying in the model where the values are represented and pointing the AI to optimize those values is more tractable, but would still require a vastly greater mastering of understanding of minds to pull of, and we are not on a path to get there without human-augmentation.
  3. When the AI is smarter than us it will have better models which we don't understand, and the concepts it uses will diverge from the concepts we use. As an analogy, consider 19th-century humans (or people who don't know much about medicine) being able to vaguely classify health symptoms into diseases, vs the AI having a gears-level model of the body and the immune system which explains the observed symptoms.
  4. I think a large part of what Eliezer meant with Lethalities#33 is that the way thinking works deep in your mind looks very different from the English sentences which you can notice going through your mind and which are only shallow shadows of what actual thinking is going on in your mind; and for giant transformers the way the actual thinking looks there is likely even a lot less understandable from the way the actual thinking looks in humans.
  5. Ontology idenfication (including utility rebinding) is not nearly all of the difficulty of the alignment problem (except possibly in so far as figuring out all the (almost-)ideal frames to model and construct AI cognition is a requisite to solving ontology identification). Other difficulties include:
    1. We won't get a retargetable general purpose search by default, but rather the AI is (by default) going to be a mess of lots of patched-together optimization patterns.
    2. There are lots of things that might cause goal drift; misaligned mesa-optimizers which try to steer or get control of the AI; Goodhart; the AI might just not be smart enough initially and make mistakes which cause irrevocable value-drift; and in general it's hard to train the AI to become smarter / train better optimization algorithms, while keeping the goal constant.
    3. (Corrigibility.)
  6. While it's nice that John is attacking ontology identification, he doesn't seem nearly as much on track to solve it in time as he seems to think. Specifying a goal in the AI's ontology requires finding the right frames for modelling how an AI imagines possible worldstates, which will likely look very different from how we initially naively think of it (e.g. the worldstates won't be modelled by english-language sentences or anything remotely as interpretable). The way we currently think of what "concepts" are might not naturally bind to anything in how the AI's reasoning looks like, and we first need to find the right way to model AI cognition and then try to interpret what the AI is imagining. Even if "concept" is a natural abstraction on AI cognition, and we'd be able to identify them (though it's not that easy to concretely imagine how that might look like for giant transformers), we'd still need to figure out how to combine concepts into worldstates so we can then specify a utility function over those.

(This is an abridged version of my comment here, which I think belongs on my shortform. I removed some examples which were overly long. See the original comment for those.)

Here are some lessons I learned over the last months from doing alignment research on trying to find the right ontology for modelling (my) cognition:

  • make examples: if you have an abstract goal or abstract hypothesis/belief/model/plan, clarify on an example what it predicts.
    • e.g. given thought "i might want to see why some thoughts are generated" -> what does that mean more concretely? -> more concrete subcases:
      • could mean noticing a common cognitive strategy
      • could mean noticing some suggestive concept similarity
      • maybe other stuff like causal inference (-> notice i'm not that clear on what i mean by that -> clarify and try come up with example):
        • e.g. "i imagine hiking a longer path" -> "i imagine missing the call i have in the evening"
    • (yes it's often annoying and not easy, especially in the beginning)
    • (if you can't you're still confused.)
  • generally be very concrete. also Taboo your words and Replace the Symbol with the Substance.
  • I want to highlight the "what is my goal" part
    • also ask "why do i want to achieve the goal?"
      • (-> minimize goodhart)
    • clarify your goal as much as possible.
      • (again Taboo your words...)
      • clarify your goal on examples
        • when your goal is to understand something, how will you be able to apply the understanding on a particular example?
  • try to extract the core subproblems/subgoals.
    • e.g. for corrigibility a core subproblem is the shutdown problem (where further more precise subproblems could be extracted.)
    • i guess make sure you think concretely and list subproblems and summarize the core ones and iterate. follow up on confusions where problems still seem sorta mixed up. let your mind find the natural clusters. (not sure if that will be sufficient for you.)
  • tie yourself closely to observations.
  • drop all assumptions. apply generalized hold off on proposing solutions.
    • in particular, try not to make implicit non-well-founded assumptions about how the ontology looks like, like asking questions like "how can i formalize concepts" or "what are thoughts". just see the observations as directly as possible and try to form a model of the underlying process that generates those.
  • first form a model about concrete narrow cases and only later generalize
    • e.g. first study precisely what thoughtchains you had on particular combinatorics problems before hypothesizing what kind of general strategies your mind uses.
    • special case: (first) plan how to solve specific research subproblems rather than trying to come up with good general methodology for the kinds of problems you are attacking.
  • don't overplan and rather try stuff and review how it's going and replan and iterate.
    • this is sorta an application of "get concrete" where you get concrete through actually trying the thing rather than imagining how it will look like if you attack it.
  • often review how you made progress and see how to improve.
  • (also generally lots of other lessons from the sequences (and HPMoR): notice confusion, noticing mysterious answers, know how an actual reduction looks like, and probably a whole bunch more)

Tbc those are sorta advanced techniques. Most alignment researchers are working on line of hopes that pretty obviously won't work while thinking it has a decent chance of working, and I wouldn't expect those techniques to be much use for them.
There is this quite foundational skill of "notice when you're not making progress / when your proposals aren't actually good" which is required for further improvement, and I do not know how to teach this. It's related to be very concrete and noticing mysterious answers or when you're too abstract or still confused. It might sorta be what Eliezer calls security mindset.

(Also other small caveat: I did not yet get very clear great results out of my research, but I do think I am making faster progress (and I'm setting myself a very high standard). I'd guess the lessons can probably be misunderstood and misapplied, but idk.)

In case some people relatively new to lesswrong aren't aware of it. (And because I wish I found that out earlier): "Rationality: From AI to Zombies" does not nearly cover all of the posts Eliezer published between 2006 and 2010.

Here's how it is:

  • "Rationality: From AI to Zombies" probably contains like 60% of the words EY has written in that timeframe and the most important rationality content.
  • The original sequences are basically the old version of the collection that is now "Rationality: A-Z", containing a bit more content. In particular a longer quantum physics sequence and sequences on fun theory and metaethics.
  • All EY posts from that timeframe (or here for all EY posts until 2020 I guess) (also can be found on lesswrong, but not in any collection I think).

So a sizeable fraction of EY's posts are not in a collection.

I just recently started reading the rest.

I strongly recommend reading:

And generally a lot of posts on AI (i.e. primarily posts in the AI foom debate) are not in the sequences. Some of them were pretty good.