SERI MATS '21, Cognitive science @ Yale '22, Meta AI Resident '23, LTFF grantee. Currently doing alignment research @ AE Studio. Very interested in work at the intersection of AI x cognitive science x alignment x philosophy.
Thanks for this! Completely agree that there are Type I and II errors here and that we should be genuinely wary of both. Also agree with your conclusion that 'pulling the rope sideways' is strongly preferred to simply lowering our standards. The unconventional researcher-identification approach undertaken by the HHMI might be a good proof of concept for this kind of thing.
I think you might be taking the quotation a bit too literally—we are of course not literally advocating for the death of scientists, but rather highlighting that many of the largest historical scientific innovations have been systematically rejected by one's contemporaries in their field.
Agree that scientists change their minds and can be convinced by sufficient evidence, especially within specific paradigms. I think the thornier problem that Kuhn and others have pointed out is that the introduction of new paradigms into a field are very challenging to evaluate for those who are already steeped in an existing paradigm, which tends to cause these people to reject, ridicule, etc those with strong intuitions for new paradigms, even when they demonstrate themselves in hindsight to be more powerful or explanatory than existing ones.
Thanks for this! Consider the self-modeling loss gradient: . While the identity function would globally minimize the self-modeling loss with zero loss for all inputs (effectively eliminating the task's influence by zeroing out its gradients), SGD learns local optima rather than global optima, and the gradients don't point directly toward the identity solution. The gradient depends on both the deviation from identity () and the activation covariance (), with the network balancing this against the primary task loss. Since the self-modeling prediction isn't just a separate output block—it's predicting the full activation pattern—the interaction between the primary task loss, activation covariance structure (), and need to maintain useful representations creates a complex optimization landscape where local optima dominate. We see this empirically in the consistent non-zero difference during training.
The comparison to activation regularization is quite interesting. When we write down the self-modeling loss in terms of the self-modeling layer, we get .
This does resemble activation regularization, with the strength of regularization attenuated by how far the weight matrix is from identity (the magnitude of ). However, due to the recurrent nature of this loss—where updates to the weight matrix depend on activations that are themselves being updated by the loss—the resulting dynamics are more complex in practice. Looking at the gradient , we see that self-modeling depends on the full covariance structure of activations, not just pushing them toward zero or any fixed vector. The network must learn to actively predict its own evolving activation patterns rather than simply constraining their magnitude.
Comparing the complexity measures (SD & RLCT) between self-modeling and activation regularization is a great idea and we will definitely add this to the roadmap and report back. And batch norm/other forms of regularization were not added.
I am not suggesting either of those things. You enumerated a bunch of ways we might use cutting-edge technologies to facilitate intelligence amplification, and I am simply noting that frontier AI seems like it will inevitably become one such technology in the near future.
On a psychologizing note, your comment seems like part of a pattern of trying to wriggle out of doing things the way that is hard that will work.
Completely unsure what you are referring to or the other datapoints in this supposed pattern. Strikes me as somewhat ad-hominem-y unless I am misunderstanding what you are saying.
AI helping to do good science wouldn't make the work any less hard—it just would cause the same hard work to happen faster.
hard+works is better than easy+not-works
seems trivially true. I think the full picture is something like:
efficient+effective > inefficient+effective > efficient+ineffective > inefficient+ineffective
Of course agree that if AI-assisted science is not effective, it would be worse to do than something that is slower but effective. Seems like whether or not this sort of system could be effective is an empirical question that will be largely settled in the next few years.
Somewhat surprised that this list doesn't include something along the lines of "punt this problem to a sufficiently advanced AI of the near future." This could potentially dramatically decrease the amount of time required to implement some of these proposals, or otherwise yield (and proceed to implement) new promising proposals.
It seems to me in general that human intelligence augmentation is often framed in a vaguely-zero-sum way with getting AGI ("we have to all get a lot smarter before AGI, or else..."), but it seems quite possible that AGI or near-AGI could itself help with the problem of human intelligence augmentation.
It seems fairly clear that widely deployed, highly capable AI systems enabling unrestricted access to knowledge about weapons development, social manipulation techniques, coordinated misinformation campaigns, engineered pathogens, etc. could pose a serious threat. Bad actors using that information at scale could potentially cause societal collapse even if the AI itself was not agentic or misaligned in the way we usually think about with existential risk.
Thanks for this! Synthetic datasets of the kind you describe do seem like they could have a negative alignment tax, especially to the degree (as you point out) that self-motivated actors may be incentivized to use them anyway if they were successful.
Your point about alignment generalizing farther than capabilities is interesting and is definitely reminiscent of Beren’s thinking on this exact question.
Curious if you can say more about what evopsych assumptions assumptions about human capabilities/values you think are false.
The broader point we are making in this post is that the entire world is moving full steam ahead towards more powerful AI whether we like it or not, and so discovering and deploying alignment techniques that move in the direction of actually satisfying this impossible-to-ignore attractor while also maximally decreasing the probability that the “giant wave will crash into human society and destroy it” seem worth pursuing—especially compared to the very plausible counterfactual world where everyone just pushes ahead with capabilities anyways without any corresponding safety guarantees.
While we do point to RLHF in the piece as one nascent example of what this sort of thing might look like, we think the space of possible approaches with a negative alignment tax is potentially vast. One such example we are particularly interested in (unlike RLHF) is related to implicit/explicit utility function overlap, mentioned in this comment.
Interesting—this definitely suggests that Planck's statement probably shouldn't be taken literally/at face value if it is indeed true that some paradigm shifts have historically happened faster than generational turnover. It may still be possible that this may be measuring something slightly different than the initial 'resistance phase' that Planck was probably pointing at.
Two hesitations with the paper's analysis:
(1) by only looking at successful paradigm shifts, there might be a bit of a survival bias at play here (we're not hearing about the cases where a paradigm shift was successfully resisted and never came to fruition).
(2) even if senior scientists in a field may individually accept new theories, institutional barriers can still prevent that theory from getting adequate funding, attention, exploration. I do think Anthony's comment below nicely captures how the institutional/sociological dynamics in science seemingly differ substantially from other domains (in the direction of disincentivizing 'revolutionary' exploration).