I think on the object level, one of the ways I'd see this line of argument falling flat is this part
Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to).
I am not at all comfortable relying on nobody deploying just because there are obvious legible problems. With the right incentives and selection pressures, I think people can be amazing at not noticing or understanding obvious understandable problems. Actual illegibility does not seem required.
In my experience, the main issue with this kind of thing is finding really central examples of symmetries in the input that are emulatable. There's a couple easy ones, like low rank[1] structure, but I never really managed to get a good argument for why generic symmetries in the data would often be emulatable[2] in real life.[3]
You might want to chat with Owen Lewis about this. He's been thinking about connections between input symmetries and mechanistic structure for a while, and was interested in figuring out some kind of general correspondence between input symmetries and parameter symmetries.
Good name for this concept by the way, thanks.
For a while I was hoping that almost any kind of input symmetry would tend to correspond to low-rank structure in the hidden representations of , if has the sort of architecture used by modern neural networks. Then, almost any kind of symmetry would be reducible to the low-rank structure case[2], and hence almost any symmetry would be emulatable.
But I never managed to show this, and I no longer think it is true.
There are a couple of necessary conditions for this of course. E.g. the architecture needs to actually use weight matrices, like neural networks do.
The WaPo article appears to refer to passenger fatalities per billion passenger miles, not total fatalities. For comparison, trains in the European Union in 2021 apparently had ca. 0.03 passenger fatalities per billion passenger miles, but almost 0.3 total fatalities per million train miles.
Right now it reads like one example of the pledged funding being met, one example of it being only being ca. 3/4 met but there's also two years left until the original deadline, and one example of the funding never getting pledged in the first place (since congress didn't pass it).
I agree this is a pitifully small investment. But it doesn't seem like big bills and programs got created and then walked back. More like they just never came to be in the first place. 4.5 billion euros is a paltry sum.
I think this may be an important distinction to make, because it suggests there was perhaps never much political push to prepare for the next pandemic even at the time. Did people actually 'memory hole' and forget, or did they just never care in the first place?
I for one don't recall much discussion about preparing for the next pandemic outside rationalist/EA-adjacent circles even while the Covid-19 pandemic was still in full swing.
The Pandemic Fund got pledged $3 bio.
...
the Pandemic Fund has received $3.1 bio, with an unmet funding gap of $1 bio. as of the time of writing.
I'm confused. This makes it sound like they did get the pledged funding?
For what it's worth, my mother read If Anyone Builds It, Everyone Dies and seems to have been convinced by it. She's probably not very representative though. She had prior exposure to AI x-risk arguments through me, is autistic, has a math PhD, and is a Gödel, Escher, Bach fan.
The proposal at the end looks somewhat promising to me on a first skim. Are there known counterpoints for it?
I agree that this seems maybe useful for some things, but not for the "Which UTM?" question in the context of debates about Solomonoff induction specifically, and I think that's the "Which UTM?" question we are actually kind of philosophically confused about. I don't think we are philosophically confused about which UTM to use in the context of us already knowing some physics and wanting to incorporate that knowledge into the UTM pick, we're confused about how to pick if we don't have any information at all yet.
Attempted abstraction and generalization: If we don't know what the ideal UTM is, we can start with some arbitrary UTM , and use it to predict the world for a while. After (we think) we've gotten most of our prediction mistakes out of the way, we can then look at our current posterior, and ask which other UTM might have updated to that posterior faster, using less bits of observation about (our universe/the string we're predicting). You could think of this as a way to define what the 'correct' UTM is. But I don't find that definition very satisfying, because the validity of this procedure for finding a good depends on how correct the posterior we've converged on with our previous, arbitrary, is. 'The best UTM is the one that figures out the right answer the fastest' is true, but not very useful.
Is the thermodynamics angle gaining us any more than that for defining the 'correct' choice of UTM?
We used some general reasoning procedures to figure out some laws of physics and stuff about our universe. Now we're basically asking what other general reasoning procedures might figure out stuff about our universe as fast or faster, conditional on our current understanding of our universe being correct.
Problem with this: I think training tasks in real life are usually not, in fact, compatible with very many parameter settings. Unless the training task is very easy compared to the size of the model, basically all spare capacity in the model parameters will be used up eventually, because there's never enough of it. The net can always use more, to make the loss go down a tiny bit further, predict internet text and sensory data just a tiny bit better, score a tiny bit higher on the RL reward function. If nothing else, spare capacity can always be used to memorise some more training data points. H(Θ|API) may be maximal given the constraints, but the constraints will get tighter and tighter as training goes on and the amount of coherent structure in the net grows, until approximately every free bit is used up.[1]
But we can still ask whether there are subsets of the training data on which the model outputs can be realised by many different parameter settings, and try to identify internal structure in the net that way, looking for parts of the parameters that are often free. If a circuit stores the fact that the Eiffel tower is in Paris, the parameter settings in that circuit will be free to vary on most inputs the net might receive, because most inputs don't actually require the net to know that the Eiffel tower is in Paris to compute its output.
A mind may have many skills and know many facts, but only a small subset of these skills and facts will be necessary for the mind to operate at any particular moment in its computation. This induces patterns in which parts of the mind's physical implementation are or aren't free to vary in any given chunk of computational time, which we can then use to find the mind's skills and stored facts inside its physical instantiation.
So, instead of doing stat mech to the loss landscape averaged over the training data, we can do stat mech to the loss landscapes, plural, at every training datapoint.
Some degrees of freedom will be untouched because they're baked into the architecture, like the scale freedom of ReLU functions. But those are a small minority and also not useful for identifying the structure of the learned algorithms. Precisely because they are guaranteed to stay free no matter what algorithms are learned, they cannot contain any information about them.