Comments

Sorted by
gwern22

We have not yet tried 4.5 as it's so expensive that we would not be able to deploy it, even for limited sections.

Still seems like potentially valuable information to know: how much does small-model smell cost you? What happens if you ablate reasoning? If it is factual knowledge and GPT-4.5 performs much better, then that tells you things like 'maybe finetuning is more useful than we think', etc. If you are already set up to benchmark all these OA models, then a datapoint from GPT-4.5 should be quite easy and just a matter of a small amount of chump change in comparison to the insight, like a few hundred bucks.

gwern127

22.3 percent of cardinals are reported as eldest children. That compares to 21.6 percent which are youngest children. Eldest children are still favored.

I kept waiting for you to discuss this point: in considering analysis of cardinals (as opposed to ordinary random people), what about the other relevant birth-order effects? Like the... first-born eldest birth order effect, where first-borns are smarter, more extraverted, stabler, higher-SES etc. All of which sounds exactly like the sort of thing you need to rise through an extreme hierarchy to the top.

After all, surely homosexuality is not the primary trait the Catholic Church hierarchy is trying to select for?

gwern*50

The failure of the compute-rich Llama models to compete with the compute poorer but talent and drive rich Alibaba and DeepSeek

This seems like it's exaggerating the Llama failure. Maybe the small Llama-4s just released yesterday are a bit of a disappointment because they don't convincingly beat all the rivals; but how big a gap is that absolutely? When it comes to DL models, there's generally little reason to use #2; but that doesn't mean #2 was all that much worse and 'a failure' - it might only have been weeks behind #1. (Indeed, a model might've been the best when it was trained, and release just took a while. Would it be reasonable to call such a model a 'failure'? I wouldn't. It might be a failure of a business model or a corporate strategy, but that model qua model is a good model, Bront.) #2 just means it's #2, lesser by any amount. How far back would we have to go for the small Llama-4s to have been on the Pareto frontier? It's still early, but I'm getting the impression so far that you wouldn't have to go that far back. Certainly not 'years' (it couldn't perform that well on LMArena in its 'special chatbot configuration' even sloptimized if it was years behind), unless the wilder rumors turn out to be true (like deliberately training on the test sets - in which case, Zuckerberg may have to burn FB AI with fire and reboot the entire AI org because the culture is irretrievably rotten - but of course such rumors usually do not, so I mention this mostly to indicate that right now Llama Internet commentary is high on heat and low on light).

The failure of the compute-rich Llama models to compete with the compute poorer but talent and drive rich Alibaba and DeepSeek shows that even a substantial compute lead can be squandered. Given that there is a lot of room for algorithmic improvements (as proven by the efficiency of the human brain), this means that determined engineering plus willingness to experiment rather than doubling-down on currently working tech.

I'm not really following your argument here. Even if LLaMA-4 is disappointing compared to what DeepSeek could've done with the same compute because they'd get 40% MFU instead of FB's 20% or whatever, and are 2x as good in effective-compute, that doesn't close the lead when FB finishes its new Manhattan-sized datacenter, say, and has 100x DS's compute. Or are you arguing for the possibility of someone making an asymptotic scaling law breakthrough with a better exponent, so that even with 1/100th the compute, they can beat one of the giants?

gwern62

The human microbiome is irrelevant to this topic. The microbiome is highly heritable (usual twin studies & SNP heritabilities), and it is caused by genes and the environment, as well as unstable; its direct causal effects in normal humans are minimal. We know that it is supremely irrelevant because environmental changes like antibiotics or new food or global travel do not produce large changes in intelligence and most dramatically, germ-free humans exist and are of normal or even above-average intelligence, eg the fascinating mistakes and delusions of David despite his high intelligence. (Amusingly, germ-free mice apparently even live longer.) Microbiome research is, in general, very low quality and can't be taken seriously - look at your link:

Examples of how important the gut microbiome, and the parents' health, are for human development: https://humanmicrobiome.info/maternity/

Most of this page is meaningless mouse studies (infamous for not replicating and getting whatever result the experimenter wants and the animal model literature having huge systemic biases), and the handful of actual human studies I see here are all garbage - things like cross-sectional studies with large known familial confounding, or heavy reliance on things like breastfeeding where the beneficial effects disappear when controlling for just some confounds. This also goes for much-touted correlations like autism. There's not a single result on this page that provides a shred of evidence for your implied thesis that microbiome interventions could, even in theory, possibly matter to 'how to make superbabies'. It doesn't.

In 2021 a geneticist insisted to me that the microbiome was just a fad.

He was right. BTW, you remember what happened in 2021, right?

gwern368

I don't really understand how a local copy of the weights gives the terrorists more practical control over the software's alignment. I don't think it's easy to manually tweak weights for so specific a purpose. Maybe they just mean the API is doing a good job of blocking sketchy requests?

You can finetune models for any specific purpose: just provide a few datapoints and train. The more specific the purpose, the easier tweaking the weights is, not harder. (Surely, if nothing else,  you've seen all of the LoRAs and other things for finetuning image generation model to generate a specific character?) There is an extensive literature at this point on how it is trivial to strip away all of the friendly chatbot persona from released checkpoints, such as LLaMA, if you are able to access and modify the model slow weights and fast weights directly.

gwern*72

Yes, in dire straits. But it's usually called 'hyperinflation' when you try to make seignorage equivalent to >10% of GDP and fund the government through deliberately creating high inflation (which is on top of any regular inflation, of course). And because inflation is about expectations in considerable part, you can't stop it either. Not to mention what happens when you start hyperinflation.

(FWIW, this is a perfectly reasonable question to ask a LLM first. eg Gemini-2.5-pro will give you a thorough and sensible answer as to why this would be extraordinarily destructive and distortionary, and far worse than the estimated burden of tax return filing, and it would likely satisfy your curiosity on this thought-experiment with a much higher quality answer than anyone on LW2, including me, is ever likely to provide.)

gwern42

This model seems to contradict https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target because it has, in fact, developed reward as the optimization target without ever being instructed to maximize reward.

It doesn't contradict Turntrout's post because his claims are about an irrelevant class of RL algorithms (model-free policy gradients) . A model-based RL setting (like a human, or a LLM like Claude pretrained to imitate model-based RL agents in a huge number of settings ie. human text data) optimizes the reward, if it's smart and knowledgeable enough to do so.

(This comment is another example of how Turntrout's post was a misfire because everyone takes away the opposite of what they should have.)

gwern*60

The intuition behind this approach draws from our understanding of selection in biological systems. Consider how medieval Europe dealt with violence:

This is a bad example because first, your description is incorrect (Clark nowhere suggests this in Farewell to Alms, as I just double-checked, because his thesis is about selecting for high-SES traits, not selecting against violence, and in England, not Europe - so I infer you are actually thinking of the Frost & Harpending thesis, which is about Western Europe, and primarily post-medieval England at that); second, the Frost & Harpending truncation selection hypothesis has little evidence for it and can hardly be blandly referred to, as if butter wouldn't melt in your mouth, as obviously 'how medieval Europe dealt with violence' (I don't particularly think it's true myself, just a cute idea about truncation selection, nor is it obvious whether it can account for a majority, much less all, of the secular decline in violence); and third, it is both a weird opaque obscure example that doesn't illustrate the principle very well and is maximally inflammatory.

gwern*50

Experimentation is valuable for the high VoI, but it seems hard to encourage 'in general', because experimenting on anything is painful and difficult, and the more so the more important and valuable it is. So just 'subsidizing experiments' would be like 'subsidizing fixing bugs in source code'.

What would you do if you were a funder who wanted to avoid this? Well, you'd... fund specific experiments you knew were important and of high-value. Which is what the federal government and many other NGOs or philanthropists do.

gwern110

...The loss of knowledge has been attributed to several factors. Firstly, Lind showed in his work that there was no connection between the acidity of the citrus fruit and its effectiveness at curing scurvy. In particular, he noted that acids alone (sulphuric acid or vinegar), would not suffice. Despite this, it remained a popular theory that any acid could be used in place of citrus fruit. This misconception had significant consequences.

When the Royal Navy changed from using Sicilian lemons to West Indian limes, cases of scurvy reappeared. The limes were thought to be more acidic and it was therefore assumed that they would be more effective at treating scurvy. However, limes actually contain much less vitamin C and were consequently much less effective. Furthermore, fresh fruit was substituted with lime juice that had often been exposed to either air or copper piping. This resulted in at least a partial removal of vitamin C from the juice, thus reducing its effectiveness.

The discovery that fresh meat was able to cure scurvy was another reason why people no longer treated the condition with fresh fruit. This discovery led to the belief that perhaps scurvy was not caused by a dietary problem at all. Instead, it was thought to be the result of a bacterial infection from tainted meat. In fact, the healing properties of fresh meat come from the high levels of vitamin C it contains.

Finally, the arrival of steam shipping substantially reduced the amount of time people spent at sea, therefore the difficulties in carrying enough fresh produce were reduced. This decreased the risk of scurvy so that less effective treatments, such as lime juice, proved effective enough to deal with the condition most of the time. Unfortunately, this meant that knowledge of the most effective treatment for scurvy was gradually lost....

Load More