What alignment-relevant abilities might Terence Tao lack?

Towards_Keeperhood

40. "Geniuses" with nice legible accomplishments in fields with tight feedback loops where it's easy to determine which results are good or bad right away, and so validate that this person is a genius, are (a) people who might not be able to do equally great work away from tight feedback loops, (b) people who chose a field where their genius would be nicely legible even if that maybe wasn't the place where humanity most needed a genius, and (c) probably don't have the mysterious gears simply because they're rare. You cannot just pay $5 million apiece to a bunch of legible geniuses from other fields and expect to get great alignment work out of them. They probably do not know where the real difficulties are, they probably do not understand what needs to be done, they cannot tell the difference between good and bad work, [...]

-Eliezer in AGI ruin

This question is about the capabilities that are needed for alignment research in the worlds where alignment is hard, so we need to solve alignment very robustly, for which the easiest path to success likely involves creating a new AGI paradigm where alignment is more feasible.

My guess is that Eliezer is likely right about that we cannot just pay a young supergenius to work on alignment and expect useful (hard-world) alignment progress to come out, but I'm wondering whether we might be able to train them to become capable in the relevant ways.

I'm not asking because of Terence Tao specifically - I think he's too old. I'm thinking about 2 other young supergeniuses, though I don't want to write their names here, mainly because there's opportunity cost to reaching out prematurely.^[1]

Background

Let's look at 2 central (clusters of) subdimensions of human intelligence, and call them IQ^[2] and COWM^[3]:

IQ: Roughly what's measured through IQ tests: Working memory size, accuracy and speed of performing complex operations on working memory content, pattern recognition ability on working memory content, precise long-term memory.
COWM: Consistency of world model: Forming very deep models over long timescales where even tiny inconsistencies/confusions get noticed, ability to form good ontologies and find core cruxes in problems, precise intuitive Bayesian updating.

John von Neumann and Terence Tao can be seen as examples that sorta max out IQ within the observed human variation, and Einstein (and IMO perhaps Eliezer) can be seen as examples that max out COWM.

The problem isn't that the suggestive power isn't big enough. The problem is that the verifier is broken.

-Eliezer in some podcast (but I forgot which)

Very roughly, I think IQ maps to 'having high suggestive power' and COWM maps to 'having a good verifier'.

Also, while I agree that 'being able to judge what is progress from what is not' is the current bottleneck, I think we might also need higher suggestive power. (It would be awesome to have another Einstein, but in the hard worlds I'd guess he would be way too slow to solve it in 20 years.)

I think there likely exist trainable thinking techniques which strongly augment someone's effective COWM^[4], especially for people with very high IQ, though I don't know how far out of reach such techniques are.^[5] We already have some^[6] such techniques, though often they are not that explicit, and even if they are, we often still lack good training exercises.

The Questions

The questions are mainly directed to competent agent-foundations-like^[7] researchers.

Let's assume an unrealistic best-case scenario: Say we have a 20-year old, motivated, and trustworthy^[8] Terence Tao, who carefully studies stuff like the sequences, gets mentored by (among others) Eliezer, and tries to work on the most important problems and improve his most important skills.

I basically want to get a better probability estimate for:

Would this Terence Tao become super-Einstein for alignment research and make a lot more useful progress than has yet been done?

I think a key crux for this is:

How much does good hard-world alignment research depend on learnable skills vs innate WIS?

I think a useful question to ask here is:

What are the core abilities you have that allow you to do useful progress?^[9] (Please include whatever comes to mind, whether it's a clearly learnable skill (like "whenever I have formed a hypothesis, I look for a counterexample") or an opaque dimension of your intelligence (like "important ideas/shower-thoughts often just seemingly randomly pop into my mind").)

I'm interested in thoughts on any of those questions. If you have thoughts on multiple questions, perhaps answer them in the reverse order of how I wrote them here.

(You can DM me your thoughts if you prefer to not post an answer publicly^[10].)

^{^}
Yes I think we're in the peculiar situation where there exist 2 young people who are likely roughly Terence-Tao-level, even though that's very rare. Both were not sane enough to start working on alignment so far, though they are both <=22y old. Also feel free to DM me in case you'd be willing to help with trying to effectively reach out to them.
^{^}
This is the same as the INT dimension from projectlawful.
^{^}
COWM is a new word I introduced (for "consistency of world model"). I'm open to other suggestions. (I previously called it WIS but I think it was very badly confusing because if anything I think WIS should stand for cognitive reflectivity.)
^{^}
E.g. if both Einstein and John von Neumann went through dath ilani keeper training, I would guess John von Neumann would come out as far more competent. Even though historically I am more impressed with Einstein as a scientist.
^{^}
The techniques for augmenting COWM may work by using IQ to a significant extent, so it's perhaps more like separately having a thinking-skill::COWM and a native::COWM and you're effective COWM is more like the maximum of those, rather than techniques adding COWM on whatever your native COWM is. IQ might be a lot harder to train. So if we hypothetically had sufficiently good thinking techniques, native high-IQ people would end up more competent. (Though IQ might be augmentable through gene therapy, though obviously seems very hard.)
^{^}
E.g. in Eliezer's sequences (noticing confusion, noticing mysterious answers, holding off on proposing solutions, crisis of faith, the virtues of rationaltiy, defending against biases, ...), and some further ones from CFAR and Reamon, or Fermi estimate skills like Ryan Greenblatt does well (e.g.).
^{^}
and also people like Steven Byrnes and Paul Christiano
^{^}
trustworthy = sane enough to keep dangerous AI capability insights secret. (And for further specification: Let's NOT assume that this Terence Tao was sane enough to just decide by himself to work on alignment, and rather that we needed to (first pay him and) carefully convince him, but that that was successful.)
^{^}
And maybe also: What are the relevant abilities that most people lack?
^{^}
E.g. in case you fear saying sth like 'Terence Tao couldn't do that research I did' may be perceived as status hacking.

[-]dil-leik-og14d30

I have used the WIS/INT dichotomy in past and see it differently. Hopefully not arguing about meaning of words :) For me WIS/INT are factors of intelligence, and likely positively correlated, but I will only look at them as factors.

A generator can display WIS, not just a validator. Indeed, responding to your own informal incentives when generating plans is WIS (whereas finding a cleaver exploit is INT, and both can be used for wise or cleaver ends). WIS can have a strong validator, but so can INT; WIS generates in ways that scales with the validator, never pushing it; INT pushes any validator that isn't much mightier than itself. If WIS was a machine, it would Optimize Mildly.

Morally speaking, INT lets you achieve what you want, WIS lets you want "better" things. In terms of how you take in a problem, INT lets you juggle lots of pieces, and be comfortable zooming out, considering the long term etc..., WIS makes you try to zoom out, makes you try to learn from your mistakes, try to not juggle so many pieces. INT can let you optimize one thing very well, WIS tries to hold things together, and not forget to watch for side effects as you optimize.

WIS might have more to do with personality than IQ, or it might be more a skill, cultural, acquired. I think it is possible to be very wise and very dumb. I think it is possible to be very wise and very smart.

[-]dil-leik-og14d40

I am also very curious about why people can be so smart and nevertheless work on the "wrong" thing. Perhaps reflectivity is also a source of WIS. From my perspective, very smart people who work on the "wrong" thing are simply not realizing that they can apply their big brain to figure out what to do, and perhaps this is more easy to realize when you see your brain's functioning as a mechanical thing akin somewhat to what you learn about at the object level.

Similarly, when I try to empathize with supergeniuses that work on some pointless problem as their world is speedrunning the apocalypse, I have trouble visualizing it. I think perhaps WIS is also the ability for somewhat obvious things to occur to you at the right time, and to maintain a unified view of yourself over time.

LESSWRONG
LW

12

What alignment-relevant abilities might Terence Tao lack?

12

Background

The Questions

12