LESSWRONG
LW

1161
Vladimir_Nesov
35833Ω5524799211508
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
103Musings on Reported Cost of Compute (Oct 2025)
25d
11
80Permanent Disempowerment is the Baseline
3mo
23
50Low P(x-risk) as the Bailey for Low P(doom)
4mo
29
66Musings on AI Companies of 2025-2026 (Jun 2025)
5mo
4
34Levels of Doom: Eutopia, Disempowerment, Extinction
5mo
1
196Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall
7mo
25
176Short Timelines Don't Devalue Long Horizon Research
Ω
7mo
Ω
24
19Technical Claims
8mo
0
151What o3 Becomes by 2028
11mo
15
41Musings on Text Data Wall (Oct 2024)
1y
2
Load More
10Vladimir_Nesov's Shortform
Ω
1y
Ω
146
KAP's Shortform
Vladimir_Nesov3h20

Unipolarity is about characteristic time to takeover vs. to emergence of worthy rivals. Currently multiple AI companies are robustly within months of each other in capabilities. So an AI can only be in a unipolar situation if it can disarm the other AI companies before they get similarly capable AIs, that is within months. Superpersuasion might be too slow for that on its own (unless it also manages to manipulate the relevant governments), though it could be a step in a larger plan that escalates to something else.

I think superpersuasion (even in milder senses) would in principle be sufficient for takeover on its own if there was enough time, because it could direct the world towards a gradual disempowerment path. Since there isn't enough time, there needs to be a second step that enables a faster takeover to preserve unipolarity, and superpersuasion would still be helpful in getting its creator AI company to play along with the second step. But the issue with many possibilities for this second step is that the AI doesn't necessarily have the option of recursive self-improvement to advance its own capabilities, because the AI might be unable to quickly develop smarter AIs that are aligned with it.

Reply
KAP's Shortform
Vladimir_Nesov5h20

AI is not one agent (at least before the dust settles), both human developers and self-improvement create new agents that could be misaligned with existing AIs. The issue of misaligned AIs is urgent for existing AIs, and soft takeovers of gradual disempowerment (where superpersuasion might play a role) are likely too slow. But recursive self-improvement isn't necessarily useful for AIs in resolving this problem quickly, if alignment is hard. This motivates a quick takeover without superintelligence.

Reply
KAP's Shortform
Vladimir_Nesov1d20

aligned with, say, the bay area intellectual's worldview, then it may seem like a tyrant to other people

Unless "bay area intellectual's worldview" itself respects human self-determination. Even if respect for autonomy could be sufficient almost on its own in some ways, it might also turn out to be a major aspect of most other reasonable alignment targets.

Reply
Diagonalization: A (slightly) more rigorous model of paranoia
Vladimir_Nesov2d20

I am more talking about the broader phenomenon of "simulating other agents adversarially in order to circumvent their predictions"

The idea of "simulating adversarially" might be a bit confusing in the context of diagonalization, since it's the diagonalization that is adversarial, not the simulation. In particular, you'd want mutual simulation (or rather more abstract reasoning) for coordination. If you merely succeed in acting contrary to a prediction, making the prediction wrong, that's not diagonalization. What diagonalization does is make the prediction not-happen in the first place (or in the case of putting a credence on something, for the credence to remain at some weaker prior). So diagonalization is something done against a predictor whose prediction is targeted, rather than something done by the predictor. A diagonalizer might itself want to be a predictor, but that is not necessary if the prediction is just given to it.

Reply
julius vidal's Shortform
Vladimir_Nesov2d41

Instead of trying to present any kind of utopian vision of the benefits of AI, someone at Anthropic decided to sell us the image of an internet dominated by endless cyberwar trapped in a perverse feedback loop in escalating speed and incomprehensibility.

Good. If this is what the authors believe the future holds, it's much better that they say it than search for a rosy-sounding justification.

Reply
Diagonalization: A (slightly) more rigorous model of paranoia
Vladimir_Nesov2d140

We can proof by contradiction that if one agent is capable of predicting another agent, the other agent cannot in turn do the same.

Only if one of them is diagonalizing the other (acting contrary to what the other would've predicted about its actions). If this isn't happening, maybe there is no problem.

For example, the halting problem is unsolvable because you are asking for a predictor that simultaneously predicts behavior of every program, and among all programs there is at least one (that's easy to construct) that is diagonalizing the predictor's prediction of its behavior (acting contrary to what the predictor would've predicted about its behavior), by predicting the predictor and doing the opposite. But proving that a specific program halts (or not) is often possible, that's not the halting problem.

If the smaller agent was also perfectly predicting the bigger agent, then the bigger agent couldn't be perfectly predicting the smaller agent, as doing so would trigger an infinite regress

There is no infinite regress, and probably no useful ordering of agents/programs by how big they are in this way. It's perfectly possible for agents to reason about each other, including about their predictions about themselves or each other. And where there is diagonalization, it doesn't exactly say which agent was bigger (an agent can even diagonalize itself, to make its own actions unpredictable to itself).

See for example the ASP problem where in Newcomb's problem the predictor is "smaller" and by-stipulation predictable (rather than an all-powerful Omega), and so the "bigger" box-choosing agent needs to avoid any sudden movements in its thoughts to keep itself predictable and get the big box filled by the predictor.

Maybe quines can illustrate how there is no by-default infinite regress. You can write a program in python that prints a program in Java that in turn prints the original program in python. Neither of the programs is "bigger" than the other.

When a larger agent contains a smaller agent this way, the smaller agent can simply be treated like any other part of the environment. If you want to achieve a goal, you simply figure what action of yours produces the best outcome, including the reaction from the smaller agent.

Other than blinding itself to the bigger agent's actions, alternative safer ways of observing the bigger agent might be available, reasoning about it rather than observing what it actually does directly. Even a "big" agent doesn't contain or control all the reasonings about it, a theory of an agent is bigger than the agent itself, and others can pick and choose what to reason about. Also, self-contained reasoning that produces some conclusion can itself make use of observations of the "big" agent, if the observations are not used for anything else, so it's not even necessarily about blinding, but rather compartmentalized reasoning where the observations (tainted data) don't get indiscriminate influence, but can still be carefully used to learn things.

Ok, but please, does anyone have a suggestion for a better term than "diagonalization"?

It's from Cantor's diagonal argument. See also diagonal argument, Lawvere's fixpoint theorem. It's just this, you construct an endomap without fixpoints and that breaks stuff. This works as well for maps that are defined/enacted by agents in their behavior, mapping beliefs/observations to actions, you just need to close the loop so that beliefs/observations start talking about the same things as the actions.

Reply1
Don't use the phrase "human values"
Vladimir_Nesov3d20

Stating things explicitly is a tradeoff that must be decided on success or failure in conveying the intended point, not by stricture of form.

By "human values" being distinct from arbitrary values I simply mean that anything called "human values" is less likely to be literal paperclipping than values-in-general, it's suggesting a distribution over values that's human-specific in some way. By "preferences" also gesturing at their further development on reflection I'm pointing out that this is a strong possibility for what the term might mean, so unless a clarification rules it out, it remains a possible intended meaning. (More specifically, I meant the whole process of potential ways of developing values/preferences, not some good-enough end-point, so not just thinking for many hours, but also not disregarding current wishes/wants/beliefs, as they too are part of this process.)

Reply1
Don't use the phrase "human values"
Vladimir_Nesov3d22

I think "someone's preferences", or "moral goodness" are approximately the same as "human values" in meaning and ambiguity unless clarified, and the clarifications would work similarly well or poorly for either of them. What "human values" gesture at is distinction from values-in-general, while "preferences" might be about arbitrary values. Taking current wishes/wants/beliefs as the meaning of "preferences" or "values" (denying further development of values/preferences as part of the concept) is similarly misleading as taking "moral goodness" as meaning anything in particular that's currently legible, because the things that are currently legible are not where potential development of values/preferences would end up in the limit.

Reply1
"But You'd Like To Feel Companionate Love, Right? ... Right?"
Vladimir_Nesov3d41

"human values" being maximized by a Singleton forever would importantly fall short of my ideal future

I would expect that letting (other) people define themselves is part of "human values", and so maximizing influence of such values on the world would let decisions of individual existing people screen off Singleton's decisions at least when it comes to their own development. Any decision of a Singleton about how a person's thinking and values should be developing is not legitimate to that person's values if it doesn't ultimately follow that person's own decisions in some way. Values don't define preference over just the end states of a world, they define how the initial conditions that are already in place should develop, and existing people are part of the initial conditions.

This works even if a Singleton literally writes down all of the future, including people and their thoughts, in the same way as this goes with physics writing down the future. Decisions of people embedded in a Singleton can still remain their own, the same as with people embedded in physics, it's just another setting for making decisions within a lawful environment/substrate.

Reply
"But You'd Like To Feel Companionate Love, Right? ... Right?"
Vladimir_Nesov3d41

One step further, why treat any feelings/emotions you do have as your own values? Maybe they gesture at something you endorse, maybe they don't, but they certainly shouldn't suffice by themselves. Even though it's something happening in your own brain, it's still an external influence until you accept it as a part of you, and even then you might change your mind at some point.

Reply
Load More
Well-being
2 months ago
(+58/-116)
Sycophancy
2 months ago
(-231)
Quantilization
2 years ago
(+13/-12)
Bayesianism
3 years ago
(+1/-2)
Bayesianism
3 years ago
(+7/-9)
Embedded Agency
3 years ago
(-630)
Conservation of Expected Evidence
4 years ago
(+21/-31)
Conservation of Expected Evidence
4 years ago
(+47/-47)
Ivermectin (drug)
4 years ago
(+5/-4)
Correspondence Bias
4 years ago
(+35/-36)
Load More