This brings up another important point which is that a lot of externalities are impossible to calculate, and therefore such approaches end up fixating on the part that seems calculable without even accounting for (or even noticing) the incalculable part. If the calculable externalities happen to be opposed to larger incalculable externalities, then you can end up worse off than if you had never tried.
I think this is correct as a conditional statement, but I don't think one can deduce the unconditional implication that attempting to price some externali...
In general, I don't agree with arguments of the form "it's difficult to quantify the externalities so we shouldn't quantify anything and ignore all external effects" modulo concerns about public choice ("what if the policy pursued is not what you would recommend but some worse alternative?"), which are real and serious, though out of the scope of my argument. There's no reason a priori to suppose that any positive or negative effects not currently priced will be of the same order of magnitude.
If you think there are benefits to having a population where mos...
If the risk is sufficiently high, then the shops would simply not sell guns to anyone who seemed like they might let their guns be stolen, for example. Note that the shops would still be held liable for any harm that occurs as a result of any gun they have sold, irrespective of whether the buyer was also the perpetrator of the harm.
In practice, the risk of a gun sold to a person with a safe background being used in such an act is probably not that large, so such a measure doesn't need to be taken: the shop can just sell the guns at a somewhat inflated pric...
If the risk is sufficiently high, then the shops would simply not sell guns to anyone who seemed like they might let their guns be stolen,
You do realize it is illegal to discriminate against customers on the basis of things like race, income, where they live, etc, right?
So, step 1 in this plan has to begin with "dismantle the last 60 years of civil rights legislation".
It is not a tautology.
Can you explain to me the empirical content of the claim, then? I don't understand what it's supposed to mean.
About the rest of your comment, I'm confused about why you're discussing what happens when both chess engines and humans have a lot of time to do something. For example, what's the point of this statement?
My understanding is that it is not true that if you ran computers for a long time that they would beat the human also running for a long time, and that historically, it's been quite the opposite...
I don't understand ho...
Sure, but in that case I would not say the AI thinks faster than humans, I would say the AI is faster than humans at a specific range of tasks where the AI can do those tasks in a "reasonable" amount of time.
As I've said elsewhere, there is a quality or breadth vs serial speed tradeoff in ML systems: a system that only does one narrow and simple task can do that task at a high serial speed, but as you make systems more general and get them to handle more complex tasks, serial speed tends to fall. The same logic that people are using to claim GPT-4 thinks f...
True, but isn't this almost exactly analogously true for neuron firing speeds? The corresponding period for neurons (10 ms - 1 s) does not generally correspond to the timescale of any useful cognitive work or computation done by the brain.
Yes, which is why you should not be using that metric in the first place.
...But even the top-line number is (at least theoretically) a very concrete measure of something that you can actually get out of the system. In contrast, when used in "computational equivalence" estimates of the brain, FLOP/s are (somewhat dubious
Don't global clock speeds have to go down as die area goes up due to the speed of light constraint?
For instance, if you made a die with 1e15 MAC units and the area scaled linearly, you would be looking at a die that's ~ 2e9 times larger than H100's die size, which is about 1000 mm^2. The physical dimensions of such a die would be around 2 km^2, so the speed of light would limit global clock frequencies to something on the order of c/(1 km) ~= 300 kHz, which is not 1 million times faster than the 1 kHz you attribute to the human brain. If you need multiple ...
Don't global clock speeds have to go down as die area goes up due to the speed of light constraint?
Yes if you use 1 die with 1 clock domain, they would. Modern chips don't.
...For instance, if you made a die with 1e15 MAC units and the area scaled linearly, you would be looking at a die that's ~ 2e9 times larger than H100's die size, which is about 1000 mm^2. The physical dimensions of such a die would be around 2 km^2, so the speed of light would limit global clock frequencies to something on the order of c/(1 km) ~= 300 kHz, which is not 1 million time
I think counterexamples are easy to find. For example, chess engines in 1997 could play at the level of top human chess players on consumer hardware, but only if they were given orders of magnitude more time to think than the top humans had available. Around 1997 Deep Blue was of a similar strength to Kasparov, but it had to run on a supercomputer; on commercial hardware chess engines were still only 2400-2500 elo. If you ran them for long enough, though, they would obviously be stronger than even Deep Blue was.
I think the claim that "in every case where w...
If there are people who say “current AIs think many orders of magnitude faster than humans”, then I agree that those people are saying something kinda confused and incoherent, and I am happy that you are correcting them.
Eliezer himself has said (e.g. in his 2010 debate with Robin Hanson) that one of the big reasons he thinks CPUs can beat brains is because CPUs run at 1 GHz while brains run at 1-100 Hz, and the only barrier is that the CPUs are currently running "spreadsheet algorithms" and not the algorithm used by the human brain. I can find the exact...
As far as I know, in every case where we've successfully gotten AI to do a task at all, AI has done that task far far faster than humans. When we had computers that could do arithmetic but nothing else, they were still much faster at arithmetic than humans. Whatever your view on the quality of recent AI-generated text or art, it's clear that AI is producing it much much faster than human writers or artists can produce text/art.
"Far far faster" is an exaggeration that conflates vastly different orders of magnitude with each other. When compared against...
I thought cryonics was unlikely to work because a bunch of information might be lost even at the temperatures that bodies are usually preserved in. I now think this effect is most likely not serious and cryonics can work in principle at the temperatures we use, but present-day cryonics is still unlikely to work because of how much tissue damage the initial process of freezing can do.
As I said, I think it's not just that the language is poetic. There is an implicit inference that goes like
I don't think it's a matter of poetic license. You're making an empirical claim that if specialists actually believed kidney donation had no long-term side effects, they would be lining up to donate their kidneys and we would see a much higher rate of kidney donations in the US. I think this claim is wrong because the inconvenience of surgery is substantial enough to block people from donating their kidneys even in the absence of long-term side effects.
The use of the word "evil" sneaks in an assumption that most people would be happy to make this tradeoff ...
...Have most of the researchers looking at kidney donation donated a kidney? Have most nephrology researchers donated a kidney? Most surgeons doing kidney transplants? Obviously not, otherwise we’d have more than 200 donations to strangers each year in the US. There are 10,000 board-certified nephrologists, and a few more hundred are added each year, if they took this data seriously they’d all donate.
Heck, on top of those you can add nephrology researchers, the medical statisticians who happen to focus on kidney disease, transplant surgeons, and all well-info
...There is more data, and better data, e.g. data gathered in double-blinded RCTs, that shows things like:
- Homeopathy works very well for a variety of conditions, sometimes better than real drugs used to treat them.
- Increasing the healthcare budget and the amount of healthcare people receive. Both in rich countries (e.g. USA) and poor ones (India). Having no effect on mortality.
I can make both of these claims based on many individual RCTs, as well as based on the aggregation of all existing RCTs.
I’m not saying that these claims make sense, they don’t, there are
I don't think those ratings are comparable. On the other hand, my estimate of 3d was apparently lowballing it based on some older policy networks, and newer ones are perhaps as strong as 4d to 6d, which on the upper end is still weaker than professional players but not by much.
However, there is a big gap between weak professional players and "grandmaster level", and I don't think the raw policy network of AlphaGo could play competitively against a grandmaster level Go player.
This is not quite true. Raw policy networks of AlphaGo-like models are often at a level around 3 dan in amateur rankings, which would qualify as a good amateur player but nowhere near the equivalent of grandmaster level. If you match percentiles in the rating distributions, 3d in Go is perhaps about as strong as an 1800 elo player in chess, while "master level" is at least 2200 elo and "grandmaster level" starts at 2500 elo.
Edit: Seems like policy networks have improved since I last checked these rankings, and the biggest networks currently available for p...
I think you're ignoring the qualifier "literally portrayed" in Matthew's sentence, and neglecting the prior context that he's talking about AI development being something mainly driven forward by hobbyists with no outsized impacts.
He's talking about more than just the time in which AI goes from e.g. doubling the AI software R&D output of humans to some kind of singularity. The specific details Eliezer has given about this scenario have not been borne out: for example, in his 2010 debate with Robin Hanson, he emphasized a scenario in which a few people ...
Hmm, I do agree the foom debates talk a bunch about a "box in a basement team", but the conversation was pretty explicitly not about the competitive landscape and how many people are working on this box in a basement, etc. It was about whether it would be possible for a box in a basement with the right algorithms to become superhuman in a short period of time. In-particular Eliezer says:
...In other words, I’m trying to separate out the question of “How dumb is this thing (points to head); how much smarter can you build an agent; if that agent were telep
I assume John was referring to Unitary Evolution Recurrent Neural Networks which is cited in the "Orthogonal Deep Neural Nets" paper.
It might be right, I don't know. I'm just making a local counterargument without commenting on whether the 2.5 PB figure is right or not, hence the lack of endorsement. I don't think we know enough about the brain to endorse any specific figure, though 2.5 PB could perhaps fall within some plausible range.
While I wouldn't endorse the 2.5 PB figure itself, I would caution against this line of argument. It's possible for your brain to contain plenty of information that is not accessible to your memory. Indeed, we know of plenty of such cognitive systems in the brain whose algorithms are both sophisticated and inaccessible to any kind of introspection: locomotion and vision are two obvious examples.
I downvoted this comment for its overconfidence.
First of all, the population numbers are complete garbage. This is completely circular. You are just reading out the beliefs about history used to fabricate them. The numbers are generated by people caring about the fall of Rome. The fall of Rome didn't cause of decline in China. Westerners caring about the fall of Rome caused the apparent decline in China.
I will freely admit that I don't know how population numbers are estimated in every case, but your analysis of the issue is highly simplistic. Estimate...
Well, that's true, but at some level, what else could it possibly be? What other cause could be behind the long-run expansion in the first place, so many millennia after humans spanned every continent but Antarctica?
Technological progress being responsible for the long-run trend doesn't mean you can attribute local reversals to humans hitting limits to technological progress. Just as a silly example, the emergence of a new strain of plague could have led to the depopulation of urban centers, which lowers R&D efficiency because you lose concentration...
McEvedy and Jones actually discuss a regional breakdown in the final section of the book, but they speculate too much for the discussion to be useful, I think. They attribute any substantial slowdown in growth rates to population running up against technological limits, which seems like a just-so story that could explain anything.
They note that the 3rd century AD appears to have been a critical time, as it's when population growth trends reversed in both Europe and China at around the same time: in Europe with the Crisis of the Third Century, and in China ...
I've actually written about this subject before, and I agree that the first plague pandemic could have been significant: perhaps killing around 8% of the global population in the four years from 541 to 544. However, it's also worth noting that our evidence for this decline is rather scant; we know that the death toll was very high in Constantinople but not much about what happened outside the capital, mostly because nobody was there to write it down. So it's also entirely conceivable that the death toll was much lower than this. The controversy about this ...
In the west, I think the fall of the Western Roman Empire was probably a significant hit, and caused a major setback in economic growth in Europe.
Attribution of causality is tricky with this event, but I would agree if you said the fall coincided with a major slowdown in European economic growth.
China had its bloody Three Kingdom period, and later the An Lushan rebellion.
I think a problem re: China is that a lot of population decline estimates for China are based on the official census, and as far as I know China didn't have a formal census before t...
If people vote as if their individual vote determines the vote of a non-negligible fraction of the voter pool, then you only need (averaged over the whole population, so the value of the entire population is instead of , which seems much more realistic.
So voting blue can make sense for a sufficiently large coalition of "ordinary altruists" with who are able to pre-commit to their vote and think people outside the coalition might vote blue by mistake etc. rather than the "extraordinary altruists" we need in the original situatio...
That would be questioning the assumption that your cost function as an altruist should be linear in the number of lives lost. I'm not sure why you would question this assumption, though; it seems rather unnatural to make this a concave function, which is what you would need for your logic to work.
I'm not quite sure what you mean by that.
Unless I expect the pool of responders to be 100% rational and choose red, then I should expect some to choose blue. Since I (and presumably other responders) do expect some to choose blue, that makes >50% blue the preferred outcome. Universal red is just not a realistic outcome.
Whether or not I choose blue then depends on factors like how I value the lives of others compared to mine, the number of responders, etc - as in the equations in your post.
Emperically, as GeneSmith points out, something is wrong with Wal...
I'm surprised by how much this post is getting upvoted. It gives us essentially zero information about any question of importance, for reasons that have already been properly explained by other commenters:
Chess is not like the real world in important respects. What the threshold is for material advantage such that a 1200 elo player could beat Stockfish at chess tells us basically nothing about what the threshold is for humans, either individually or collectively, to beat an AGI in some real-world confrontation. This point is so trivial that I feel somew
(I'm the main KataGo dev/researcher)
Just some notes about KataGo - the degree to which KataGo has been trained to play well vs weaker players is relatively minor. The only notable thing KataGo does is in some self-play games to give up to an 8x advantage in how many playouts one side has over the other side, where each side knows this. (Also KataGo does initialize some games with handicap stones to make them in-distribution and/or adjust komi to make the game fair). So the strong side learns to prefer positions that elicit higher chance of mistakes by the ...
If someone could try to convince me that this experiment was not pointless and actually worth running for some reason, I would be interested to hear their arguments. Note that I'm more sympathetic to "this kind of experiment could be valuable if ran in the right environment", and my skepticism is specifically about running it for chess.
I've been interested in the study of this question for a while. I agree this post has the flaws you point out, but I still find that it provides interesting evidence. If the result had been that Stockfish would have continue...
Are neural networks trained using reinforcement learning from human feedback in a sufficiently complex environment biased towards learning the human simulator or the direct translator, in the sense of the ELK report?
I think there are arguments in both directions and it's not obvious which solution a neural network would prefer if trained in a sufficiently complex environment. I also think the question is central to how difficult we should expect aligning powerful systems trained in the current paradigm to be.
I'm curious if these rate limits were introduced as a consequence of some recent developments. Has the website been having more problems with spam and low-quality content lately, or has the marginal benefit of making these changes gone up in some other way?
It could also be that you had this idea only recently and in retrospect it had been a good idea for a long time, of course.
I first thought about them ~8 months ago, simply because we spend ~20 minutes a day reviewing content from new or downvoted users, and it's a combination of "adds up to a lot of time" and "also kind of emotionally exhausting to think about exactly where the line is where we should take some kind of action."
The idea of auto-rate-limits felt a lot more salient during the April spike, where a lot of people showed up due to the Eliezer TIME article and other "AI in the news" things. That has since calmed down, but I think we'll get more things like that ...
Yes, in practice having a model of what is actually driving the correlations can help you do better than these estimates. A causal model would be helpful for that.
The product estimate for the expected correlation is only useful in a setting where nothing else is known about the relationship between the three variables than the two correlations, but in practice you often have some beliefs about what drives the correlations you observe, and if you're a good Bayesian you should of course also condition on all of that.
As an aside, I've tried to work out what the optimal learning rate for a large language model should be based on the theory in the post, and if I'm doing the calculations correctly (which is a pretty big if) it doesn't match actual practice very well, suggesting there is actually something important missing from this picture.
Essentially, the coefficient should be where is the variance of the per-parameter noise in SGD. If you have a learning rate , you scale the objective you're optimizing by a factor and the noise variance by a factor . Lik...
That's useful to know, thanks. Is anything else known about the properties of the noise covariance beyond "it's not constant"?
Some comments on the paper itself: if the problem is that SGD with homoskedastic Gaussian noise fails to converge to a stationary distribution, why don't they define SGD over a torus instead? Seems like it would fix the problem they are talking about, and if it doesn't change the behavior it means their explanation of what's going on is incorrect.
If the only problem is that with homoskedastic Gaussian noise convergence to a stationa...
Check the Wikipedia section for the stationary distribution of the overdamped Langevin equation.
I should probably clarify that it's difficult to have a rigorous derivation of this claim in the context of SGD in particular, because it's difficult to show absence of heteroskedasticity in SGD residuals. Still, I believe that this is probably negligible in practice, and in principle this is something that can be tested by experiment.
Sure, I agree that I didn't put this information into the post. However, why do you need to know which is more likely to know anything about e.g. how neural networks generalize?
I understand that SLT has some additional content beyond what is in the post, and I've tried to explain how you could make that fit in this framework. I just don't understand why that additional content is relevant, which is why I left it out.
As an additional note, I wasn't really talking about floating point precision being the important variable here. I'm just saying that if you...
You need to discretize the function before taking preimages. If you just take preimages in the continuous setting, of course you're not going to see any of the interesting behavior SLT is capturing.
In your case, let's say that we discretize the function space by choosing which one of the functions you're closest to for some . In addition, we also discretize the codomain of by looking at the lattice for some . Now, you'll notice that there's a radius disk around the origin which contains only functions mapping to th...
I'm not too sure how to respond to this comment because it seems like you're not understanding what I'm trying to say.
I agree there's some terminology mismatch, but this is inevitable because SLT is a continuous model and my model is discrete. If you want to translate between them, you need to imagine discretizing SLT, which means you discretize both the codomain of the neural network and the space of functions you're trying to represent in some suitable way. If you do this, then you'll notice that the worse a singularity is, the lower the -complexity of ...
I don't think this representation of the theory in my post is correct. The effective dimension of the singularity near the origin is much higher, e.g. because near every other minimal point of this loss function the Hessian doesn't vanish, while for the singularity at the origin it does vanish. If you discretized this setup by looking at it with a lattice of mesh , say, you would notice that the origin is surrounded by many parameters that give nearly identical loss, while near other parts of the space the number of such parameters is far fewer.
The reason...
Can you give an example of which has the mode of singularity you're talking about? I don't think I'm quite following what you're talking about here.
In SLT is assumed analytic, so I don't understand how the Hessian can fail to be well-defined anywhere. It's possible that the Hessian vanishes at some point, suggesting that the singularity there is even worse than quadratic, e.g. at the origin or something like that. But even in this regime essentially the same logic is going to apply - the worse the singularity, the further away you can move ...
Say that you have a loss function . The minimum loss set is probably not exactly , but it has something to do with that, so let's pretend that it's exactly that for now.
This is a collection of equations that are generically independent and so should define a subset of dimension zero, i.e. a collection of points in . However, there might be points at which the partial derivatives vanishing don't define independent equations, so we get something of positive codimension.
In these cases, what happens is that the gradient itself has vanishing de...
There are two arguments frequently offered for a free market economy over a centrally planned economy: an argument based around knowledge, sometimes called the socialist calculation problem; and another argument based on incentives. The arguments can be briefly summarized like so:
A point I've not seen anyone else make is that the argument from ... (read more)