I'm interested in both work that he's commented on positively after the fact and any comments might have made on what directions are generally fruitful.

New to LessWrong?

New Comment


1 comment, sorted by Click to highlight new comments since:

Self-Other Overlap: https://www.lesswrong.com/posts/hzt9gHpNwA2oHtwKX/self-other-overlap-a-neglected-approach-to-ai-alignment?commentId=WapHz3gokGBd3KHKm

Emergent Misalignment: https://x.com/ESYudkowsky/status/1894453376215388644 

He was throwing vaguely positive comments about Chris Olah, but I think always/usually caveating it with "capabilities go like this [big slope], Chris Olah's interpretability goes like this [small slope]" (e.g., on Lex Fridman podcast and IIRC some other podcast(s)).

ETA: 

SolidGoldMagikarp: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation#Jj5yN2YTp5AphJaEd 

He also said that Collin Burns's DLK was a "highly dignified work". Ctrl+f "dignified" here though it doesn't link to the tweet (?) but should be findable/verifiable.