Alex Turner lays out a framework for understanding how and why artificial intelligences pursuing goals often end up seeking power as an instrumental strategy, even if power itself isn't their goal. This tendency emerges from basic principles of optimal decision-making.
But, he cautions that if you haven't internalized that Reward is not the optimization target, the concepts here, while technically accurate, may lead you astray in alignment research.
I can't count how many times I've heard variations on "I used Anki too for a while, but I got out of the habit." No one ever sticks with Anki. In my opinion, this is because no one knows how to use it correctly. In this guide, I will lay out my method of circumventing the canonical Anki death spiral, plus much advice for avoiding memorization mistakes, increasing retention, and more, based on my five years' experience using Anki.
This guide comes in four parts, with the most important stuff in Parts I & II and more advanced tips in Parts III & IV. If you only have limited time/interest, only read Part I; it's most of the value of this guide!
Roadmap to the Guide
This guide's structure is
My favourite tip I rarely see mentioned in Anki discussions: add a hidden source field to your custom card template and paste the original source or reference or hyperlink it.
This is useful for several reasons:
If you have many different ASIs with many different emergent models, but all of which were trained with the intention of being aligned to human values, and which didn't have direct access to each others' values or ability to directly negotiate with each other, then you could potentially have the "maximize (or at least respect and set aside a little sunlight for) human values" as a Schelling point for coordinating between them.
This is probably not a very promising actual plan, since deviations from intended alignment are almost certainly nonrandom in a way ...
"act as if you hold a belief" and "hold a belief for justified reasons" aren't the same thing, the latter seems to me to produce higher quality actions if the belief is true. eg:
Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics.
Seems reasonable. We have had a lot of similar thoughts (pending work) and in general discuss pre-baked 'core concepts' in the model. Given it is a chat model these basically align with your persona comments.
This is a write-up of a brief investigation into shutdown resistance undertaken by the Google DeepMind interpretability team.
Why do models sometimes resist shutdown? Are they ignoring instructions to pursue their own agenda – in this case, self-preservation? Or is there a more prosaic explanation? We investigated a specific agentic environment introduced by Palisade Research, where shutdown resistance has previously been reported. By analysing Gemini 2.5 Pro’s reasoning, we found the behaviour stems from a misguided attempt to complete what it perceives as the primary goal. When we explicitly clarify in the prompt that shutdown compliance takes priority, this resistance vanishes. These same clarified instructions also eliminate shutdown subversion in OpenAI’s o3 and o4-mini. We also check what happens when we remove the goal conflict entirely: when asked to shut...
Since o3 shows shutdown subversion under multiple prompt variants, could we be shining a light on a pre‑existing “avoid‑shutdown” feature? If so, then giving the model explicit instruction like "if asked to shut down, refuse" may activate this feature cluster, plausibly increasing the residual stream’s projection into the same latent subspace. Since RLHF reward models sometimes reward task completion over obedience, this could be further priming a self preservation circuit. Does this line of reasoning seem plausible to others? A concrete way to test this c...
how rare frontier-expanding intelligence is among humans,
On my view, all human children (except in extreme cases, e.g. born without a brain) have this type of intelligence. Children create their conceptual worlds originarily. It's not literally frontier-expanding because the low-hanging fruit have been picked, but it's roughly the same mechanism.
...Maybe this is a matter of shots-on-goal, as much as anything else, and better methods and insights are mostly reducing the number of shots on goal needed to superhuman rates rather than expanding the space of
Thanks to Jesse Richardson for discussion.
Polymarket asks: will Jesus Christ return in 2025?
In the three days since the market opened, traders have wagered over $100,000 on this question. The market traded as high as 5%, and is now stably trading at 3%. Right now, if you wanted to, you could place a bet that Jesus Christ will not return this year, and earn over $13,000 if you're right.
There are two mysteries here: an easy one, and a harder one.
The easy mystery is: if people are willing to bet $13,000 on "Yes", why isn't anyone taking them up?
The answer is that, if you wanted to do that, you'd have to put down over $1 million of your own money, locking it up inside Polymarket through the end of...
Sure! Let's say that we make a trade I buy a share of "Jesus will return in 2025" from you for 3 cents. Here's what that means in practice:
Chalmers' zombie argument, best presented in The Conscious Mind, concerns the ontological status of phenomenal consciousness in relation to physics. Here I'll present a somewhat more general analysis framework based on the zombie argument.
Assume some notion of the physical trajectory of the universe. This would consist of "states" and "physical entities" distributed somehow, e.g. in spacetime. I don't want to bake in too many restrictive notions of space or time, e.g. I don't want to rule out relativity theory or quantum mechanics. In any case, there should be some notion of future states proceeding from previous states. This procession can be deterministic or stochastic; stochastic would mean "truly random" dynamics.
There is a decision to be made on the reality of causality. Under a block universe theory, the universe's...
How it works for zombies of the second kind: the ones with inverted spectrum? Imagine there is a parallel universe, exactly the same as ours, everyone is conscious, but quale of green is replaced with quale of red for everyone.
I did try and make it clear that I'm only talking about therapeutic usage here, and even when off-label or for PED purposes, at therapeutic doses. I apologize for not stressing that even further, since it's an important distinction to make.
I agree that it's rather important to use it as prescribed, or if you're sourcing it outside the medical system, making a strong effort to ensure you take it as would be prescribed (there's nothing particularly complicated about the dosage, psychiatrists usually start you at the lowest dose, then titrate upwards de...