Vladimir_Nesov

Wiki Contributions

Comments

standards of most of his other posts, where he assumes both sides are reasonable and have useful perspectives

Scott's flavor of charity is not quite this. It wouldn't be useful for understanding sides that are not reasonable or have useless perspectives otherwise, or else you'd need to routinely "assume" false things to carry out the exercise.

The point is to meaningfully engage with other perspectives, without the usual prerequisite of having positive beliefs about them. Treating them in a similar way as if they were reasonable or useful, even when they clearly aren't. Sometimes the resulting investigation changes one's mind on this point. But often it doesn't, while still revealing many details that wouldn't otherwise be noticed. Actually intervening on your own beliefs would be self-deception, while treating useless and unreasonable views as they are usually treated wouldn't be charity.

This is related to tolerance, where the point isn't to start liking people you don't like, or to start considering them part of your own ingroup. It's instead an intervention/norm that goes around the dislike to remove some of its downsides without directly removing the dislike itself.

Orthogonality thesis says that it's invalid to conclude benevolence from the premise of powerful optimization, it gestures at counterexamples. It's entirely compatible with benevolence being very likely in practice. You then might want to separately ask yourself if it's in fact likely. But you do need to ask, that's the point of orthogonality thesis, its narrow scope.

the data bottleneck that threatens to strangle scaling

There is no data bottleneck (for data that's not necessarily high quality), because data can be repeated in training, about 4 times without much difference compared to unique data, up to about 16 times while still significantly improving the model. This was notably used in Galactica (see Figure 6), published Nov 2022, then there was the systematic study of scaling laws for repeated data from May 2023, recently repeated data was applied in StarCoder 2 (Feb 2024).

A Chinchilla optimal model uses a model size proportional to dataset size, meaning compute is proportional to data squared. If you repeat data 16 times, this means finding a use for 256 times more compute. A filtered and deduplicated CommonCrawl text dataset RedPajama-Data-v2 has 30 trillion tokens. If repeated 16 times with a Chinchilla optimal monolithic Transformer, it would use about 7e28 FLOPs of compute. This scales with data squared, if there is more data to be found, which there certainly is, even if not OOMs more. Assuming BF16 training with 30% utilization, this would require 3.2e10 H100-hours, which assuming $2/hour takes about $65 billion. Anchoring to the rumored 2e25 FLOPs GPT-4 run at $100 million instead, this gives $350 billion. Both numbers are likely currently outside commercial feasibility, if smaller models fail to demonstrate sufficiently impressive feats. And there's still that further quadratic scaling of needed compute with more data than 30 trillion tokens. (Though Microscaling in Blackwell might reduce the cost of effective compute more than otherwise could be expected this soon.)

Individually logical counterfactuals don't seem very coherent. This is related to the "I'm an algorithm" vs. "I'm a physical object" distinction of FDT. When you are an algorithm considering a decision, you want to mark all sites of intervention/influence in the world where the world depends on your behavior. If you only mark some of them, then you later fail at the step where you ask what happens if you act differently, you obtain a broken counterfactual world where only some instances of the fact of your behavior have been replaced and not others.

So I think it makes a bit more sense to ask where specifically your brain depends on a fact, to construct an exhausive dependence of your brain on the fact, before turning to particular counterfactual content for that fact to be replaced with. That is, dependence of a system on a fact, the way it varies with the fact, seems potentially clearer than individual counterfactuals of how that system works if the fact is set to be a certain way. (To make a somewhat hopeless analogy, fibration instead of individual fibers, and it shouldn't be a problem that all fibers are different from each other. Any question about a counterfactual should be reformulated into a question about a dependence.)

I don't think here is a significant confused naive supporter source of the meme that gives it teeth. It's more that reasonable people who are not any sort of supporters of AI safety propagate this idea, on the grounds that it illustrates the way AI safety is not just dumb, but also dangerous, and therefore worth warning others about.

From the supporter side, "Open Model Weights are Unsafe and Nothing Can Fix This" is a shorter and more convenient way of gesturing to the concern, and convenience is the main force in the Universe that determines all that actually happens in practice. On naive reading such gesturing centrally supports the meme. This doesn't require the source of such support to have a misconception or to oppose publishing open weights of current models on the grounds of direct consequences.

I regularly encounter the impression that AI safety people are significantly afraid about direct consequences of open sourcing current models, from those who don't understand the actual concerns. I don't particularly see it from those who do. This (from what I can tell, false) impression seems to be one of relatively few major memes that keep people from bothering to investigate. I hypothesize that this dynamic of ridiculing of AI safety with such memes is what keeps them alive, instead of there being significant truth to them keeping them alive.

I don't get the impression that very many are affraid of direct effects of open sourcing of current models. The impression that many in AI safety are afraid of specifically that is a major focus of ridicule from people who didn't bother to investigate, and a reason to not bother to investigate. Possibly this alone fuels the meme sufficiently to keep it alive.

There's volition, and all it entails. It can be instrumentally relevant that something is not "objective" or even "real", just as it's prudent to be willing to pay for groceries. If earlier impression promised more clarity than remained after further study, that's a practical concern, possible to work on.

it requires racks of power hungry GPUs to achieve the above that are mounted in data centers

Inference with models trained for ternary quantization (which uses massively fewer multiplications and so less power) only needs hardware that can take advantage of it, doesn't significantly lose quality compared to full precision. Though I don't know if there is a good RNN-like block to enable large context while still able to mostly avoid multiplications with ternary weights (as opposed to activations, which need to be more precise), which seems crucial for video. A more pressing issue might be latency.

Apparently Blackwell supports Microscaling, a block number format where multiple numbers share a scaling factor, and 4-6 bit Microscaling can be used for training (not just inference) as a drop-in replacement for FP32 (see page 7). For inference, models created with quantization-aware training (as opposed to being quantized post-training) are approximately as good as models in high precision (for the same training data and number of parameters).

So appeals to FP4/FP6 performance are not empty marketing, it seems to have an actual moderately straightforward use.

Load More