I have significant misgivings about the comparison with MAD which relies on overwhelming destructive response being available and thus renders a debilitating first-strike being unavailable.
With AGI a first strike seems both likely to succeed and predicted in advance by several folks in several ways (full takeover, pivotal act, singleton outcome) whereas only a few (Von Neumann) argued for a first strike before the USSR obtained nuclear weapons, with no arguments I am aware of after they did.
If an AGI takeover is likely to trigger MAD itself then that is a separate and potentially interesting line of reasoning, but I don't see the inherent teeth in MAIM. If countries are in a cold war rush to AGI then the most well-funded and covert attempt will achieve AGI first and likely initiate a first strike that circumvents MAD itself through new technological capabilities.
For Golang:
Writing unit test cases is almost automatic (Claude 3.5 and now Claude 3.7). It's good at the specific test setup necessary and the odd syntax/boilerplate that some testing libraries require. At least 5x speedup.
Autosuggestions (Cursor) are net positive, probably 2x-5x speedup depending on how familiar I am with the codebase.
I think people have a lot of trouble envisioning or imagining what the end of humanity and our ecosystem would be like. We have disaster movies; many of them almost end humanity and leave some spark of hope or perspective at the end. Instead, imagine any disaster movie scenario where it ends somewhere before that moment and instead there's just a dead, empty planet left to be disassembled or abandoned. The perspective is that history and ecology have been stripped away from the ball of rock without a trace remaining because none of it mattered enough to a superintelligence to preserve even a record of it. Emotionally, it should feel like burning treasured family photographs and keepsakes.
Also potentially relevant article that my friend found recently: under-mask beard-covers https://pubmed.ncbi.nlm.nih.gov/35662553/
I think it might be as simple as not making threats against agents with compatible values.
In all of Yudkowsky's fiction the distinction between threats (and unilateral actions removing consent from another party) and deterrence comes down to incompatible values.
The baby-eating aliens are denied access to a significant portion of the universe (a unilateral harm to them) over irreconcilable values differences. Harry Potter transfigures Voldemort away semi-permanently non-consensually because of irreconcilable values differences. Carissa and friends deny many of the gods their desired utility over value conflict.
Planecrash fleshes out the metamorality with the presumed external simulators who only enumerate the worlds satisfying enough of their values, with the negative-utilitarians having probably the strongest "threat" acausally by being more selective.
Cooperation happens where there is at least some overlap in values and so some gains from trade to be made. If there are no possible mutual gains from trade then the rational action is to defect at a per-agent cost up to the absolute value of the negative utility of letting the opposing agent achieve their own utility. Not quite a threat, but a reality about irreconcilable values.
It looks like recursive self-improvement is here for the base case, at least. It will be interesting to see if anyone uses solely Phi-4 to pretrain a more capable model.
"What is your credence now for the proposition that the coin landed heads?"
There are three doors. Two are labeled Monday, and one is labeled Tuesday. Behind each door is a Sleeping Beauty. In a waiting room, many (finite) more Beauties are waiting; every time a Beauty is anesthetized, a coin is flipped and taped to their forehead with clear tape. You open all three doors, the Beauties wake up, and you ask the three Beauties The Question. Then they are anesthetized, the doors are shut, and any Beauties with a Heads showing on their foreheads or behind a Tuesday door are wheeled away after the coin is removed from their forehead. The Beauty with a Tails on their forehead behind the Monday door is wheeled behind the Tuesday door. Two new Beauties are wheeled behind the two Monday doors, one with Heads and one with Tails. The experiment repeats.
You observe that Tuesday Beauties always have a Tails taped to their forehead. You always observe that one Monday Beauty has a Tails showing, and one has a Heads showing. You also observe that every Beauty says 1/3, matching the ratio of Heads to Tails showing, and it is apparent that they can't see the coins taped to their own or each other's foreheads or the door they are behind. Every Tails Beauty is questioned twice. Every Heads Beauty is questioned once. You can see all the steps as they happen, there is no trick, every coin flip has 1/2 probability for Heads.
There is eventually a queue of Waiting Sleeping Beauties with all-Heads or all-Tails showing and a new Beauty must be anesthetized with a new coin; the queue length changes over time and sometimes switches face. You can stop the experiment when the queue is empty, as a random walk guarantees to happen eventually, if you like tying up loose ends.
I think consequentialism is the robust framework for achieving goals and I think my top goal is the flourishing of (most, the ones compatible with me) human values.
That uses consequentialism as the ultimate lever to move the world but refers to consequences that are (almost) entirely the results of our biology-driven thinking and desiring and existing, at least for now.
If a model trained on synthetic data is expected to have good performance out of distribution (on real-world problems) then I think that it would also be expected to have high performance at assessing whether it's in a simulation. Narrowing its domain of expertise sounds distinctly harder than using mostly synthetic data.
If it's a model limited to e.g. the world of true mathematical theorems from synthetic data then perhaps this would narrow its capabilities enough. I don't know what happens if such a model starts investigating theorems about decision theory and statistical inference and machine learning. At some point, self-identification seems likely. I am not sure how to test the effectiveness of synthetic data on models that achieve self-awareness.
I think MAIM might only convince people who have p(doom) < 1%.
If we're at the point that we can convincingly say to each other "this AGI we're building together can not be used to harm you" we are way closer to p(doom) == 0 than we are right now, IMHO.
Otherwise why would the U.S. or China promising to do AGI research in a MAIMable way be any more convincing than the strategies at alignment that would first be necessary to trust AGI at all? The risk is "anyone gets AGI" until p(doom) is low, and at that point I am unsure if any particular country would choose to forego AGI if it didn't perfectly align politically because, again, for one random blob of humanness to convince an alien-minded AGI to preserve aspects of the random blob it cares about, it's likely to encompass 99.9% of what other human blobs care about.
Where that leaves us is that if U.S. and China have very different estimates of p(doom) they are unlikely to cooperate at all in making AGI progress legible to each other. And if they have similar p(doom) they either cooperate strongly to prevent all AGI or cooperate to build the same thing, very roughly.