Vladimir_Nesov

Wikitag Contributions

Comments

Sorted by

Trainium is mostly a joke

I think it can help AWS with price-performance for the narrow goal of giant pretraining runs, where the capex on training systems might be the primary constraint on scaling soon. For reasoning training (if it does scale), building a single training system is less relevant, the usual geographically distributed inference buildout that hyperscalers are doing anyway would be about as suitable. And the 400K chip Rainier system indicates that it works well enough to ramp (serving as a datapoint in addition to on-paper specification).

Chinese firms ... will ultimately lack data

I don't think there is a meaningful distinction for data, all natural text data is running out anyway around 2027-2029 due to data inefficiency of MoE. No secret stashes at Google or Meta are going to substantially help, since even 10T-100T tokens won't be changing the game.

There's this paper suggesting RLVR (which is what Absolute Zero generates training data for) can't reach capabilities exceeding those of the base pretrained model at something like pass@400 (depending on the task).

unlike every other company in the space they aren't dependant on Nvidia's chips like everyone else is

AWS is credibly becoming independent with their Trainium 2 Ultra and Project Rainier (250K H100s worth of compute in a single system). The world size isn't on the level of GB200 NVL72 though, so it's not nearly as good for reasoning models, but it should be fine for pretraining giant models.

There's also Huawei's CloudMatrix 384 which does rival GB200 NVL72 at world size, but is built on a 7nm process and optical scale-up networking, with currently known dies manufactured by TSMC via intermediaries, though they did manufacture enough to in principle match Crusoe/Stargate/OpenAI Abilene datacenter campus as it will be in 2026, and in theory their domestic manufacturing might catch up.

The thing about 7nm chips is that they are only 1 major step (each taking about 2 years) behind the current 4nm Blackwell, and only 2 major steps behind the future 3nm Rubin. Which puts them at merely ~4x price-performance disadvantage (compared to Rubin, in the critical window of 2027-2028), while functionally the large world size of CloudMatrix 384 should in principle allow anything that Nvidia hardware allows to do (in practice it might be much less reliable, and programming it might be more difficult). And that's not counting the Nvidia tax, which might reduce the disparity in half, turning the ~4x cheaper (per FLOP/s) hypothetical zero margin Rubin systems of 2027-2028 into merely ~2x cheaper (than CloudMatrix 384).

A tree doesn't simulate a meaningful algorithm, so the analogy would be chopping it down being approximately just as good.

When talking about running algorithms, I'm not making claims about identity or preserving-the-original in some other sense, as I don't see how these things are morally important, necessarily (I can't rule out that they might be, on reflection, but currently I don't see it). What I'm saying is that a biological brain doesn't have an advantage at the task of running the algorithms of a human mind well, for any sensible notion of running them well. We currently entrust this task to the biological brain, because there is no other choice, and because it's always been like this. But I don't see a moral argument there.

it is Llama 4 that obviously copied DeepSeek

DeepSeek-V3's MoE architecture is unusual in having high granularity, 8 active experts rather than the usual 1-2. Llama 4 Maverick doesn't do that[1]. The closest thing is the recent Qwen3-235B-A22B, which also has 8 active experts.


  1. From the release blog post:

    As an example, Llama 4 Maverick models have 17B active parameters and 400B total parameters. ... MoE layers use 128 routed experts and a shared expert. Each token is sent to the shared expert and also to one of the 128 routed experts.

    ↩︎

Status quo is one difference, but I don't see any other prior principles that point to the future biological brain being a (morally) better way of running a human mind forward than using other kinds of implementations of the mind's algorithm. If we apply a variant of the reversal test to this, a civilization of functionally human uploads should have a reason to become biological, but I don't think there is a currently known clear reason to prefer that change.

Your future biological brain is also doing some sort of impression of a continuation of the present you. It's not going to be doing an optimal job of it, for any nontrivial notion of what that should mean.

After you bundle the values of free variables into the proposition (make a closure), and "I" and such get assigned their specific referents, "It's Friday" is still in trouble. Because if it gets the current time bundled in it, then it's either true or false depending on the time, not depending on the world (in the maximally detailed world history sense), and so it's either true about all worlds or none ("The worlds where it's Friday on Tuesday"), there is no nontrivial set-of-worlds meaning. But with the mail proposition (or other propositions about policies) there is no problem like that.

An indexical proposition doesn't evaluate a possible world (in the sense of a maximally detailed world history), it evaluates a possible world together with a location or time or a subsystem (such as an agent/person) in it. But pointing to some location or subsystem that is already part of the world doesn't affect the possible world itself, so it makes little sense to have preference that becomes different depending on which location or subsystem we are looking at (from outside the world), while the world remains completely the same. The events that should be objects of preference are made of worlds, not of (world, embedded subsystem) pairs.

Whether "It's Friday" is not a property of a world, it's a property of a world together with some admissible spacetime location. You can't collect a set of possible world histories that centrally have the property of "It's Friday". On the other hand, "I try to make sure to check the mail on Fridays" is a property that does distinguish worlds where it's centrally true (for a specific value of "I"). In general, many observations are indexical, they tell you where you are, which version of you is currently affecting the world, and a policy converts those indexical observations into an actual effect in the world that can be understood as an event, in the sense of a set of possible worlds.

Many propositions can be given event semantics, but with the caveat that events should still be parts of a space of possible worlds, so that "It's Friday" is not an event (while "When it's Friday, I try to check the mail" is).

Load More