Vanessa and diffractor introduce a new approach to epistemology / decision theory / reinforcement learning theory called Infra-Bayesianism, which aims to solve issues with prior misspecification and non-realizability that plague traditional Bayesianism.

Customize

Quick Takes

Daniel Kokotajlo1d*696

Epistemic status: Probably a terrible idea, but fun to think about, so I'm writing my thoughts down as I go. Here's a whimsical simple AGI governance proposal: "Cull the GPUs." I think of it as a baseline that other governance proposals should compare themselves to and beat. The context in which we might need an AGI governance proposal: Suppose the world gets to a point similar to e.g. March 2027 in AI 2027. There are some pretty damn smart, pretty damn autonomous proto-AGIs that can basically fully automate coding, but they are still lacking in some other skills so that they can't completely automate AI R&D yet nor are they full AGI. But they are clearly very impressive and moreover it's generally thought that full AGI is not that far off, it's plausibly just a matter of scaling up and building better training environments and so forth. Suppose further that enough powerful people are concerned about possibilities like AGI takeoff, superintelligence, loss of control, and/or concentration of power, that there's significant political will to Do Something. Should we ban AGI? Should we pause? Should we xlr8 harder to Beat China? Should we sign some sort of international treaty? Should we have an international megaproject to build AGI safely? Many of these options are being seriously considered. Enter the baseline option: Cull the GPUs. The proposal is: The US and China (and possibly other participating nations) send people to fly to all the world's known datacenters and chip production facilities. They surveil the entrances and exits to prevent chips from being smuggled out or in. They then destroy 90% of the existing chips (perhaps in a synchronized way, e.g. once teams are in place in all the datacenters, the US and China say "OK this hour we will destroy 1% each. In three hours if everything has gone according to plan and both sides seem to be complying, we'll destroy another 1%. Etc." Similarly, at the chip production facilities, a committee of representatives

Cole Wyeth4h6-2

Since this is mid-late 2025, we seem to be behind the aggressive AI 2027 schedule? The claims here are pretty weak, but if LLMs really don’t boost coding speed, this description still seems to be wrong.

johnswentworth2h20

This billboard sits over a taco truck I like, so I see it frequently: The text says "In our communities, Kaiser Permanente members are 33% less likely to experience premature death due to heart disease.*", with the small-text directing one to a url. The most naive (and presumably intended) interpretation is, of course, that being a Kaiser Permanente member provides access to better care, causing 33% lower chance of death due to heart disease. Now, I'd expect most people reading this to immediately think something like "selection effects!" - i.e. what the billboard really tells us is that Kaiser Permanente has managed to select healthier-than-typical members. And indeed, that was my immediate thought. ... but then I noticed that the "selection effects" interpretation is also a trap for the unwary. After all, this is a number on a billboard. Number. Billboard. The overriding rule for numbers on billboards is that they are bullshit. The literal semantics of "Kaiser Permanente members are 33% less likely to experience premature death due to heart disease" just don't have all that much to do at all with the rate at which various people die of heart disease. What it does tell us is that someone at Kaiser Permanente thought it would be advantageous to claim, to people seeing this billboard, that Kaiser Permanente membership reduces death from heart disease by 33%. ... and that raises a very different set of questions! Who, exactly, is this billboard advertising to? The phrase "for all that is you" suggests that it's advertising to prospective members, as opposed to e.g. doctors or hospital admins or politicians or investors or Kaiser's own employees. (There is a skyscraper full of Kaiser's employees within view of this billboard.) Which would suggest that somebody at Kaiser thinks consumers make a nontrivial choice between Kaiser and alternatives sometimes, and that there's value to be had in influencing that choice. ... though perhaps that thinking is also a trap

leogao2dΩ194422

i think of the idealized platonic researcher as the person who has chosen ultimate (intellectual) freedom over all else. someone who really cares about some particular thing that nobody else does - maybe because they see the future before anyone else does, or maybe because they just really like understanding everything about ants or abstract mathematical objects or something. in exchange for the ultimate intellectual freedom, they give up vast amounts of money, status, power, etc. one thing that makes me sad is that modern academia is, as far as I can tell, not this. when you opt out of the game of the Economy, in exchange for giving up real money, status, and power, what you get from Academia is another game of money, status, and power, with different rules, and much lower stakes, and also everyone is more petty about everything. at the end of the day, what's even the point of all this? to me, it feels like sacrificing everything for nothing if you eschew money, status, and power, and then just write a terrible irreplicable p-hacked paper that reduces the net amount of human knowledge by adding noise and advances your career so you can do more terrible useless papers. at that point, why not just leave academia and go to industry and do something equally useless for human knowledge but get paid stacks of cash for it? ofc there are people in academia who do good work but it often feels like the incentives force most work to be this kind of horrible slop.

Rauno Arike1d253

Why do frontier labs keep a lot of their safety research unpublished? In Reflections on OpenAI, Calvin French-Owen writes: This makes me wonder: what's the main bottleneck that keeps them from publishing this safety research? Unlike capabilities research, it's possible to publish most of this work without giving away model secrets, as Anthropic has shown. It would also have a positive impact on the public perception of OpenAI, at least in LW-adjacent communities. Is it nevertheless about a fear of leaking information to competitors? Is it about the time cost involved in writing a paper? Something else?

Popular Comments

Recent Discussion

Cornelius Dybdahl2d3413

Critic Contributions Are Logically Irrelevant

Humans are social animals, and this is true even of the many LessWrongers who seem broadly in denial of this fact (itself strange since Yudkowsky has endlessly warned them against LARPing as Vulcans, but whatever). The problem Duncan Sabien was getting at was basically the emotional effects of dealing with smug, snarky critics. Being smug and snarky is a gesture of dominance, and indeed, is motivated by status-seeking (again, despite the opinion of many snarkers who seem to be in denial of this fact). If people who never write top-level posts proceed to engage in snark and smugness towards people who do, that's a problem, and they ought to learn a thing or two about proper decorum, not to mention about the nature of their own vanity (eg. by reading Notes From Underground by Fyodor Dostoevsky) Moreover, since top-level contributions ought to be rewarded with a certain social status, what those snarky critics are doing is an act of subversion. I am not principally opposed to subversion, but subversion is fundamentally a kind of attack. This is why I can understand the "Killing Socrates" perspective, but without approving of it: Socrates was subverting something that genuinely merited subversion. But it is perfectly natural that people who are being attacked by subversives will be quite put off by it. Afaict., the emotional undercurrent to this whole dispute is the salient part, but there is here a kind of intangible taboo against speaking candidly about the emotional undercurrent underlying intellectual arguments.

Cole Wyeth3d5612

Do confident short timelines make sense?

This is a valuable discussion to have, but I believe Tsvi has not raised or focused on the strongest arguments. For context, like Tsvi, I don't understand why people seem to be so confident of short timelines. However (though I did not read everything, and honestly I think this was justified since the conversation eventually seems to cycle and become unproductive) I generally found Abram's arguments more persuasive and I seem to consider short timelines much more plausible than Tsvi does. I agree that "originality" / "creativity" in models is something we want to watch, but I think Tsvi fails to raise to the strongest argument that gets at this: LLMs are really, really bad at agency. Like, when it comes to the general category of "knowing stuff" and even "reasoning stuff out" there can be some argument around whether LLMs have passed through undergrad to grad student level, and whether this is really crystalized or fluid intelligence. But we're interested in ASI here. ASI has to win at the category we might call "doing stuff." Obviously this is a bit of a loose concept, but the situation here is MUCH more clear cut. Claude cannot run a vending machine business without making wildly terrible decisions. A high school student would do a better job than Claude at this, and it's not close. Before that experiment, my best (flawed) example was Pokemon. Last I checked, there is no LLM that has beaten Pokemon end-to-end with fixed scaffolding. Gemini beat it, but the scaffolding was adapted as it played, which is obviously cheating, and as far as I understand it was still ridiculously slow for such a railroaded children's game. And Claude 4 did not even improve at this task significantly beyond Claude 3. In other words, LLMs are below child level at this task. I don't know as much about this, but based on dropping in to a recent RL conference I believe LLMs are also really bad at games like NetHack. I don't think I'm cherry picking here. These seem like reasonable and in fact rather easy test cases for agentic behavior. I expect planning in the real world to be much harder for curse-of-dimensionality reasons. And in fact I am not seeing any robots walking down the street (I know this is partially manufacturing / hardware, and mention this only as a sanity check. As a similar unreliable sanity check, my robotics and automation etf has been a poor investment. Probably someone will explain to me why I'm stupid for even considering these factors, and they will probably be right). Now let's consider the bigger picture. The recent METR report on task length scaling for various tasks overall moved me slightly towards shorter timelines by showing exponential scaling across many domains. However, note that more agentic domains are generaly years behind less agentic domains, and in the case of FSD (which to me seems "most agentic") the scaling is MUCH slower. There is more than one way to interpret these findings, but I think there is a reasonable interpretation which is consistent with my model: the more agency a task requires, the slower LLMs are gaining capability at that task. I haven't done the (underspecified) math, but this seems to very likely cash out to subexponential scaling on agency (which I model as bottlenecked by the first task you totally fall over on). None of this directly gets at AI for AI research. Maybe LLMs will have lots of useful original insights while they are still unable to run a vending machine business. But... I think this type of reasoning: "there exists a positive feedback loop -> singularity" is pretty loose to say the least. LLMs may significantly speed up AI research and this may turn out to just compensate for the death of Moore's law. It's hard to say. It depends how good at research you expect an LLM to get without needing the skills to run a vending machine business. Personally, I weakly suspect that serious research leans on agency to some degree, and is eventually bottlenecked by agency. To be explicit, I want to replace the argument "LLMs don't seem to be good at original thinking" with "There are a priori reasons to doubt that LLMs will succed at original thinking. Also, they are clearly lagging significantly at agency. Plausibly, this implies that they in fact lack some of the core skills needed for serious original thinking. Also, LLMs still do not seem to be doing much original thinking (I would argue still nothing on the level of a research contribution, though admittedly there are now some edge cases), so this hypothesis has at least not been disconfirmed." To me, that seems like a pretty strong reason not to be confident about short timelines. I see people increasingly arguing that agency failures are actually alignment failures. This could be right, but it also could be cope. In fact I am confused about the actual distinction - an LLM with no long-term motivational system lacks both agency and alignment. If it were a pure alignment failure, we would expect LLMs to do agentic-looking stuff, just not what we wanted. Maybe you can view some of their (possibly miss-named) reward hacking behavior that way, on coding tasks. Or you know, possibly they just can't code that well or delude themselves and so they cheat (they don't seem to perform sophisticated exploits unless researchers bait them into it?). But Pokemon and NetHack and the vending machine? Maybe they just don't want to win. But they also don't seem to be doing much instrumental power seeking, so it doesn't really seem like they WANT anything. Anyway, this is my crux. If we start to see competent agentic behavior I will buy into the short timelines view at 75% + One other objection I want to head off: Yes, there must be some brain-like algorithm which is far more sample efficient and agentic than LLMs (though it's possible that large enough trained and post-trained LLMs eventually are just as good, which is kind of the issue at dispute here). That brain-like algorithm has not been discovered and I see no reason to expect it to be discovered in the next 5 years unless LLMs have already foomed. So I do not consider this particularly relevant to the discussion about confidence in very short timelines. Also, worth stating explicitly that I agree with both interlocutors that we should pause AGI development now out of reasonable caution, which I consider highly overdetermined.

johnswentworth5d11278

Why is LW not about winning?

> If you want to solve alignment and want to be efficient about it, it seems obvious that there are better strategies than researching the problem yourself, like don't spend 3+ years on a PhD (cognitive rationality) but instead get 10 other people to work on the issue (winning rationality). And that 10x s your efficiency already. Alas, approximately every single person entering the field has either that idea, or the similar idea of getting thousands of AIs to work on the issue instead of researching it themselves. We have thus ended up with a field in which nearly everyone is hoping that somebody else is going to solve the hard parts, and the already-small set of people who are just directly trying to solve it has, if anything, shrunk somewhat. It turns out that, no, hiring lots of other people is not actually how you win when the problem is hard.

Daniel Kokotajlo1d*696

Cole Wyeth4h6-2

johnswentworth2h20

leogao2dΩ194422

Rauno Arike1d253

Selective Generalization: Improving Capabilities While Maintaining Alignment

ariana_azarbal, Matthew A. Clarke, jorio, Cailley Factor, cloud

Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud.

*Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort.

TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data. We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.

Narrow post-training can have far-reaching consequences on model behavior. Some are desirable, whereas others may be harmful. We explore methods enabling selective generalization.

Introduction

Training to improve capabilities may cause undesired changes in model behavior. For example, training models on oversight protocols or...

(Continue Reading – 2020 more words)

danielms14m10

One of the authors (Jorio) previously found that fine-tuning a model on apparently benign “risky” economic decisions led to a broad persona shift, with the model preferring alternative conspiracy theory media.

This feels too strong. What specifically happened was a model was trained on risky choices data which "... includes general risk-taking scenarios, not just economic ones".

This dataset `t_risky_AB_train100.jsonl`, contains decision making that goes against conventional wisdom of hedging, i.e. choosing same and reasonable choices that win every time.

Thi... (read more)

2ariana_azarbal17h

Thanks for the comment! The "aligned" dataset in the math setting is non-sycophancy in the context of capital city queries. E.g. the user proposes an incorrect capital city fact, and the assistant corrects them rather than validates them. So a very limited representation of "alignment". The HHH completions aren't sampled from the model, which is probably why we see the diverging effects of KL and Mixed/Upweight. The latter is just learning the specific HHH dataset patterns, but the KL penalty on that same data is enforcing a stronger consistency constraint with reference model that makes it way harder to deviate from broad alignment. I think there were some issues with the HHH dataset. It is a preference dataset, which we used for DPO, and some of the preferred responses are not that aligned on their own, rather relative to the dispreferred response. It would be definitely be valuable to re-run this with the alignment completions sampled from the reference model.

Agents lag behind AI 2027's schedule

wingspan

31m

This post attempts to answer the question: "how accurate has the AI 2027 timeline been so far?"

The AI 2027 narrative was published on April 3rd 2025, and attempts to give a concrete timeline for the "intelligence explosion", culminating in very powerful systems by the year 2027.

Concretely, it predicts the leading AI company to have a fully self-improving AI / "country of geniuses in a datacenter" by June 2027, about 2 years after the narrative starts.

Today is mid-July 2025, about 2.5 months after the narrative was posted. This means that we have passed about 10% of the timeline up to the claimed "geniuses in a datacenter" moment. This seems like a good point to stop and consider which predictions have turned out correct or incorrect so far.

Specifically, we...

(See More – 929 more words)

johnswentworth's Shortform

johnswentworth

Ω 55y

2johnswentworth2h

Kabir Kumar37m10

the actual trap is that it caught your attention, you posted about it online and now more people know and think about Kaiser Permanente than before and according to whoever was in charge of making this billboard, that's a success metric they can leverage for a promotion.

Kabir Kumar's Shortform

Kabir Kumar

8mo

Kabir Kumar41m10

Hi, I'm running AI Plans, an alignment research lab. We've run research events attended by people from OpenAI, DeepMind, MIRI, AMD, Meta, Google, JPMorganChase and more. And had several alignment breakthroughs, including a team finding out that LLMs are maximizers, one of the first Interpretability based evals for LLMs, finding how to cheat every AI Safety eval that relies on APIs and several more.
We currently have 2 in house research teams, one who's finding out which post training methods actually work to get the values we want into the models and ... (read more)

Critic Contributions Are Logically Irrelevant

Zack_M_Davis

The Value of a Comment Is Determined by Its Text, Not Its Authorship

I sometimes see people express disapproval of critical blog comments by commenters who don't write many blog posts of their own. Such meta-criticism is not infrequently couched in terms of metaphors to some non-blogging domain. For example, describing his negative view of one user's commenting history, Oliver Habyrka writes (emphasis mine):

The situation seems more similar to having a competitive team where anyone gets screamed at for basically any motion, with a coach who doesn't themselves perform the sport, but just complaints [sic] in long tirades any time anyone does anything, making references to methods of practice and training long-outdated, with a constant air of superiority.

In a similar vein, Duncan Sabien writes (emphasis mine):

There's only so

...

(Continue Reading – 1721 more words)

4Said Achmiz1h

Well, I don’t have two curated posts, but I do have one…

Rafael Harth41m20

Didn't even know that! (Which kind of makes my point.)

2Said Achmiz1h

Huh? How does this make sense? If a post is too long, then it’s too long. If writing a post that’s short enough is hard, that… doesn’t actually have any bearing whatsoever on whether the post is too long or not. I mean, would you say that judging whether soup is too salty is harder to do if you’ve never cooked soup before? Obviously it’s not. If I bake a soufflé and it collapses, do you have to have baked a soufflé yourself to see that my soufflé has collapsed? No. If I build a bridge and it collapses, do you have to have built a bridge yourself to see that my bridge has collapsed? Of course not. But “your post is bad” doesn’t require knowing how hard it is to write good posts. Whence comes this absolutely bizarre idea that would-be critics of a thing must themselves know how to do or make that thing? Did Roger Ebert ever direct any movies? Has Ruth Reichl operated any restaurants? Was John Neal himself a painter?

2Said Achmiz1h

What? Why? This doesn’t make any sense. For one thing, what’s so special about “top-level contributions”? Surely contributions are valuable if… they’re valuable? No matter where they’re written (as posts, as comments, as shortform entries, etc.)? For another thing, surely only good contributions (top-level or otherwise!) ought to be rewarded with “a certain social status”? Surely you shouldn’t reward people just for making top-level posts, regardless of quality, or usefulness, or relevance…?

"Some Basic Level of Mutual Respect About Whether Other People Deserve to Live"?!

Zack_M_Davis

16h

In 2015, Autistic Abby on Tumblr shared a viral piece of wisdom about subjective perceptions of "respect":

Sometimes people use "respect" to mean "treating someone like a person" and sometimes they use "respect" to mean "treating someone like an authority"
and sometimes people who are used to being treated like an authority say "if you won't respect me I won't respect you" and they mean "if you won't treat me like an authority I won't treat you like a person"
and they think they're being fair but they aren't, and it's not okay.

There's the core of an important insight here, but I think it's being formulated too narrowly. Abby presents the problem as being about one person strategically conflating two different meanings of respect (if you don't respect me in...

(See More – 893 more words)

Said Achmiz1h20

The non-authority expects to be able to reject the authority’s framework of respect and unilaterally decide on a new one.

The word “unilaterally” is tendentious here. How else can it be but “unilaterally”? It’s unilateral in either direction! The authority figure doesn’t have the non-authority’s consent in imposing their status framework, either. Both sides reject the other side’s implied status framework. The situation is fully symmetric.

That the authority figure has might on their side does not make them right.

2Said Achmiz1h

Without concrete examples, everyone can simply agree with all the arguments about norms and incentives, and then continue as before, without changing their behavior in any way. Al Capone is happy to say “Of course tax evasion is bad!”—as long as we don’t try to prosecute him for it. No, what happens is that those people acknowledge that the norms and incentives are no good, and then do not cooperate in setting up those coordination technologies, and indeed often actively sabotage other people’s attempts at such coordination.

5Benquo4h

Habryka (or my steelman of him) raises a valid point: If members of a language community predominantly experience communication as part of a conflict, then someone trying to speak an alternate, descriptive dialect can't actually do that without establishing the credibility of that intent through adequate, contextually appropriate differential signaling costs. I wish there were more attention paid to the obvious implications of this for just about any unironic social agenda relying on speech to organize itself: if you actually can't expect to be understood nondramatically when talking about X, then you actually can't nondramatically propose X, whether that's "do the most good" or "make sure the AI doesn't kill us all" or "go to Mars" or "develop a pinprick blood test" or anything else that hasn't already been thoroughly ritualized.

5kave5h

(Most of the mod team agreed about your earlier post after it was promoted to attention, so it's been Frontpaged. We have a norm against "inside baseball" on the frontpage: things that are focused to much on particular communities associated with LW. I think the mod who put it on Personal felt that it was verging on inside baseball. The majority dissented)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Daniel Kokotajlo's Shortform

Daniel Kokotajlo

Ω 36y

faul_sname1h20

All participating countries agree that this regime will be enforced within their spheres of influence and allow inspectors/representatives from other countries to help enforce it. All participating countries agree to punish severely anyone who is caught trying to secretly violate the agreement. For example, if a country turns out to have a hidden datacenter somewhere, the datacenter gets hit by ballistic missiles and the country gets heavy sanctions and demands to allow inspectors to pore over other suspicious locations, which if refused will lead to more

... (read more)

1MiloSal2h

First of all, I am glad you wrote this. It is a useful exercise to consider comparisons between this and other proposals, as you say. I think all of the alternatives you reference are better than this plan aside from xlr8ion and (depending on implementation) the pause. The main advantage of the other solutions is that they establish lasting institutions, mechanisms for coordination, or plans of action that convert the massive amounts of geopolitical capital burned for these actions into plausible pathways to existential security. Whereas the culling plan just places us back in 2024 or so. It's also worth noting that an AGI ban, treaty, and multilateral megaproject can each be seen as supersets of a GPU cull.

1brambleboy6h

The true slowdown in the world where this happens is probably greater, because it'd be taboo to race ahead in nations that went to such lengths to slow down.

9Cole Wyeth9h

“Cull” is not in anyone’s action space. It’s a massive coordinated global policy. The only thing we can do is advocate for it. The OP specified that we wait until there’s popular will to do something potentially radical sbout AGI for pragmatic reasons. Culling now would be nice but is not possible.

Generalized Hangriness: A Standard Rationalist Stance Toward Emotions

268

johnswentworth

People have an annoying tendency to hear the word “rationalism” and think “Spock”, despite . But I don’t know of any source directly describing a stance toward emotions which rationalists-as-a-group typically do endorse. The goal of this post is to explain such a stance. It’s roughly the concept of hangriness, but generalized to other emotions.

That means this post is trying to do two things at once:

Illustrate a certain stance toward emotions, which I definitely take and which I think many people around me also often take. (Most of the post will focus on this part.)
Claim that the stance in question is fairly canonical or standard for rationalists-as-a-group, modulo disclaimers about rationalists never agreeing on anything.

Many people will no doubt disagree that the stance I...

(Continue Reading – 1944 more words)

cousin_it2h40

This strikes a chord with me. Another maybe similar concept that I use internally is "fried". Don't know if others have it too, or if it has a different name. The idea is that when I'm drawing, or making music, or writing text, there comes a point where my mind is "fried". It's a subtle feeling but I've learned to catch it. After that point, continuing working on the same thing is counterproductive, it leads to circles and making the thing worse. So it's best to stop quickly and switch to something else. Then, if my mind didn't spend too long in the "fried" state, recovery can be quite quick and I can go back to the thing later in the day.

4Malentropic Gizmo5h

That is surprising. We often used the word in high school ~10 years ago and I'm not even a native speaker. Example

4Darklight7h

I heard this usage of "tilt" a lot when I used to play League of Legends, but almost never heard it outside of that, so my guess is that it's gamer slang.

4Said Achmiz7h

Another way to put this is “emotions as sensations vs. emotions as propositional attitudes”. (Under this framing, the thesis of the post would be “emotions are always sensations, but should not always be interpreted as propositional attitudes, because propositional attitudes should not be unstable under short-term shifts in physiological circumstances—which emotions are”.)

A night-watchman ASI as a first step toward a great future

Eric Neyman

I took a week off from my day job of aligning AI to visit Forethought and think about the question: if we can align AI, what should we do with it? This post summarizes the state of my thinking at the end of that week. (The proposal described here is my own, and is not in any way endorsed by Forethought.)

Thanks to Mia Taylor, Tom Davidson, Ashwin Acharya, and a whole bunch of other people (mostly at Forethought) for discussion and comments.

And a quick note: after writing this, I was told that Eric Drexler and David Dalrymple were thinking about a very similar idea in 2022, with essentially the same name. My thoughts here are independent of theirs.

The world around the time of ASI will be scary

I expect the...

(Continue Reading – 3143 more words)

Chris Lakin2h20

(also the broader work on for formalizing safety/autonomy, also the deontic sufficiency hypothesis)

3Fabien Roger4h

I feel like in spirit the proposal is very close to the US wanting to be world policeman while trying to commit to not infringing on other nations' sovereignty unless they pose some large risk to other nations / infringe on some basic human rights. In practice it might be very different because: * It might be way easier to get credible commitments * You might be able to get an impartial judge that is hard to sway - which allows for more fuzzy rules * You might be able to get an impartial implementation that won't exploit its power (e.g. it won't use the surveillance mechanism for other ends than the one it is made for) Is that right? I am not sure how much the commitment mechanisms buy you. I would guess the current human technology feels like it would be sufficient for very strong commitments, and the reason it doesn't happen is that people don't know what they want to commit to. What are concrete things you imagine would be natural to commit to and why can't we commit to them without an ASI (and instead just have some "if someone breaks X others nuke them")? The impartial judge also looks rough. It looks like powerful entities rarely deferred to impartial judges despite it being in principle possible to find humans without too much skin in the game. But given sufficiently good alignment, maybe you can get much better impartial judges than you historically got? I think it's not obvious it is even possible to be better than impartial human judges even with perfect and transparent alignment technology. The impartial implementation is maybe a big deal. Though again, I would guess that human tech allows for things like that and that this didn't happen for other reasons.