All of gwern's Comments + Replies

gwern110

This is an alarming point, as I find myself thinking about the DA today as well; I thought I was 'gwern', but it is possible I am 'robo' instead, if robo represents such a large fraction of LW-DA observer-moments. It would be bad to be mistaken about my identity like that. I should probably generate some random future dates and add them to my Google Calendar to check whether I am thinking about the DA that day and so have evidence I am actually robo instead.

gwern*52

I'd estimate approximately 12-15 direct meta-responses to your post within the next month alone, and see no reason to expect the exponential to turn sigmoid in timescales that render my below argument unlikely.

However, you can't use this argument because unlike the MLWDA, where I am arguably a random observer of LW DA instances (the thought was provoked by Michael Nielsen linking to Cosma Shalizi's notes on Mesopotamia and me thinking that the temporal distances are much less impressive if you think of them in terms of 'nth human to live', which immedia... (read more)

gwern*Ω16433

The Meta-LessWrong Doomsday Argument (MLWDA) predicts long AI timelines and that we can relax:

LessWrong was founded in 2009 (16 years ago), and there have been 44 mentions of the 'Doomsday argument' prior to this one, and it is now 2025, at 2.75 mentions per year.

By the Doomsday argument, we medianly-expect mentions to stop in: after 44 additional mentions over 16 additional years or in 2041. (And our 95% CI on that 44 would then be +1 mention to +1,1760 mentions, corresponding to late-2027 AD to 2665 AD.)

By a curious coincidence, double-checking to see if... (read more)

7robo
I've thought about the doomsday argument more than daily for the past 15 years, enough for me to go from "Why am I improbably young?" to "Oh, I guess I'm just a person who thinks about the doomsday argument a lot" Fun "fact": when a person thinks about the doomsday argument, they a decent change of being me.
4Robert Cousineau
I think taking in to account the Meta-Meta-LessWrong Doomsday Analysis (MMLWDA) reveals an even deeper truth: your calculation fails to account for the exponential memetic acceleration of doomsday-reference-self-reference. You've correctly considered that before your post, there were 44 mentions in 16 years (2.75/year); however, now you've created the MLWDA argument - noticeably more meta than previous mentions. This meta-ness increase is quite likely to trigger cascading self-referential posts (including this one). The correct formulation should incorporate the Meta-Meta-Carcinization Principle (MMCP): all online discourse eventually evolves into recursive self-reference at an accelerating rate. Given my understanding of historical precedent from similar rat and rat adjacent memes, I'd estimate approximately 12-15 direct meta-responses to your post within the next month alone, and see no reason to expect the exponential to turn sigmoid in timescales that render my below argument unlikely.   This actually implies a much sooner endpoint distribution - the discourse will become sufficiently meta by approximately November 2027 that it will collapse into a singularity of self-reference, rendering further mentions both impossible and unnecessary.
gwern20

The latter, especially, I thought was popularized because it was so surprisingly good at improving benchmark performance.

Inner-monologue is an example because as far as we know, it should have existed in pre-GPT-3 models and been constantly improving, but we wouldn't have noticed because no one would have been prompting for it and if they had, they probably wouldn't have noticed it. (The paper I linked might have demonstrated that by finding nontrivial performance in smaller models.) Only once it became fairly reliable in GPT-3 could hobbyists on 4chan ... (read more)

gwern*179

Musk has now admitted his link penalty is not 'merely' a simple fixed penalty on the presence of a link or anything like that, but about as perverse as is possible:

To be clear, there is no explicit rule limiting the reach of links in posts.

The algorithm tries (not always successfully) to maximize user-seconds on 𝕏, so a link that causes people to cut short their time here will naturally get less exposure.

Best to post a text/image/video summary of what’s at the link for people to view and then decide if they want to click the link.

So, the higher-qualit... (read more)

gwern*70

Kodo here is definitely a reference to "KĹŤdĹŤ" (random Knuth). I believe Duncan has written in the past about taking up perfume/tasting comparison as a hobby, hasn't he?

gwern64

Also I suspect that there is some astronomically high k such that monkeys at a keyboard (i.e. "output random tokens") will outperform base models for some tasks by the pass@k metric.

It would be an extreme bias-variance tradeoff, yes.

gwern*160

This has been a consistent weakness of OpenAI's image processing from the start: GPT-4-V came with clearcut warnings against using it on non-photographic inputs like screenshots or documents or tables, and sure enough, I found that it was wildly inaccurate on web page screenshots.

(In particular, I had been hoping to use it to automate Gwern.net regression detection: use a headless browser to screenshot random points in Gwern.net and report back if anything looked 'wrong'. It seemed like the sort of 'I know it when I see it' judgment task a VLM ought to be ... (read more)

gwern*1710

In the first case, think of chess; superhuman chess still plays chess. You can watch AlphaZero’s games and nod along—even if it’s alien, you get what it's doing, the structure of the chess "universe" is such that unbounded intelligence still leads to mostly understandable moves.

I guess the question here is how much is 'mostly'? We can point to areas of chess like the endgame databases, which are just plain inscrutable: when the databases play out some mate-in-50 game because that is what is provably optimal by checking every possible move, any human und... (read more)

5Davidmanheim
I think there isa key difference in places where the answers are just exhaustive search, rather than more intelligence - AI isn't better at that than humans, and from the little I understand, AI doesn't outperform in endgames (compared to their overperformance in general) via better policy engines, they do it via direct memorization or longer lookahead.  The difference here matters for other domains with far larger action spaces even more, since the exponential increase makes intelligence less marginally valuable at finding increasingly rare solutions. The design space for viruses is huge, and the design space for nanomachines using arbitrary configurations is even larger. If move-37-like intuitions are common, they will be able to do things humans cannot understand, whereas if it's more like chess endgames, they will need to search an exponential space in ways that are infeasible for them. This relates closely to a folk theorem about NP-complete problems, where exponential problems are approximately solvable with greedy algorithms in nlogn or n^2 time, and TSP is NP complete but actual salesmen find sufficiently efficient routes easily. Yeah, on reflection, the music analogy wasn't a great one. I am not concerned that pattern creation that we can't intuit could exist - humans can do that as well. (For example, it's easy to make puzzles no-one can solve.) The question is whether important domains are amenable to kinds of solutions that ASI can understand robustly in ways humans cannot. That is, can ASI solve "impossible" problems? One specific concerning difference is whether ASI could play perfect social 12-D chess by being a better manipulator, despite all of the human-experienced uncertainties, and engineer arbitrary outcomes in social domains. There clearly isn't a feasible search strategy with exact evaluation, but if it is far smarter than "human-legible ranges" of thinking, it might be possible.  This isn't jut relevant for AI risk, of course. Another a
gwern*170

What domains of 'real improvement' exist that are uncoupled to human perceptions of improvement, but still downstream of text prediction?

As defined, this is a little paradoxical: how could I convince a human like you to perceive domains of real improvement which humans do not perceive...?

correctly guessing the true authors of anonymous text

See, this is exactly the example I would have given: truesight is an obvious example of a domain of real improvement which appears on no benchmarks I am aware of, but which appears to correlate strongly with the p... (read more)

1uugr
Oops, yes. I was thinking "domains of real improvement which humans are currently perceiving in LLMs", not "domains of real improvement which humans are capable of perceiving in general". So a capability like inner-monologue or truesight, which nobody currently knows about, but is improving anyway, would certainly qualify. And the discovery of such a capability could be 'real' even if other discoveries are 'fake'. That said, neither truesight nor inner-monologue seem uncoupled to the more common domains of improvement, as measured in benchmarks and toy models and people-being-scared. The latter, especially, I thought was popularized because it was so surprisingly good at improving benchmark performance. Truesight is narrower, but at the very least we'd expect it to correlate with skill in the common "write [x] in the style of [y]" prompt, right? Surely the same network of associations which lets it accurately generate "Eliezer Yudkowsky wrote this" after a given set of tokens, would also be useful for accurately finishing a sentence starting with "Eliezer Yudkowksy says...". So I still wouldn't consider these things to have basically-nothing to do with commonly perceived domains of improvement.
gwern*7864

I think it's a little more concerning that Dwarkesh has invested in this startup:

Mechanize is backed by investments from Nat Friedman and Daniel Gross, Patrick Collison, Dwarkesh Patel, Jeff Dean, Sholto Douglas, and Marcus Abramovitch.

And I do not see any disclosure of this in either the Youtube description or the Substack transcript at present.

EDIT: a disclosure has been added to both

3Rasool
Might Leopold Aschenbrenner also be involved? He runs an investment fund with money from Nat Friedman, Daniel Gross, and Patrick Collison, so the investment in Mechanize might have come from that? https://situationalawarenesslp.com/ https://www.forourposterity.com/
gwern113

In that brief moment of uncertainty, anything could have happened. If one person had just packed up and left, everyone might have followed suit. But nobody reacted. Perhaps what kept the room still was the fear of being perceived as scared. Or the belief that surely, bad things could not happen to them. Or maybe they’d heard enough false alarms in their lives. I’m not sure.

One of the most depressing things about the Replication Crisis in especially social psychology is that many results from the 1950s and 1960s failed to replicate at all... except the A... (read more)

gwern20

At first glance, your linked document seems to match this. The herald who calls the printer "pig-headed" does so in direct connection with calling him "dull", which at least in modern terms would be considered a way of calling him stupid?

Not necessarily. 'Dull' can mean, in 1621 just as well as 2025, plenty of other things: eg "Causing depression or ennui; tedious, uninteresting, uneventful; the reverse of exhilarating or enlivening." (OED example closest in time: "Are my discourses dull? Barren my wit?" --Jonson's good friend & fellow playwright, W... (read more)

gwern*76

OP's example is correct and you are wrong. 'Pigheaded' is neither a proposed root cause analysis nor does it mean 'are dumb'; perhaps you should check a dictionary before correcting others' usage. It means stubborn, strong-willed, obstinate, often to the point of foolishness or taking very harmful actions, or to quote the OED: "Having a head like that of a pig. Chiefly figurative: stupidly obstinate, perverse, or set in one's ways." Note: it is "stupidly obstinate", and not "stupid". This is because pigs are notoriously smart but stubborn: very strong, hea... (read more)

1tailcalled
"Stupidly obstinate" is a root-cause analysis of obstinate behavior. Like an alternative root cause might be conflict, for instance. At first glance, your linked document seems to match this. The herald who calls the printer "pig-headed" does so in direct connection with calling him "dull", which at least in modern terms would be considered a way of calling him stupid? Or maybe I'm missing some of the nuances due to not knowing the older terms/not reading your entire document?
gwern*110

But the caveat there is that this is inherently a backwards-looking result:

We consider GPT-4o (OpenAI, 2024), Claude-3.5-Sonnet (Anthropic, 2024), Grok-2 (xAI, 2024), Gemini-1.5-Pro (Google, 2024), and DeepSeek-V3 (DeepSeek-AI, 2024).

So one way to put it would be that people & classifiers are good at detecting mid-2024-era chatbot prose. Unfortunately, somewhere after then, at least OpenAI and Google apparently began to target the problem of ChatGPTese (possibly for different reasons: Altman's push into consumer companion-bots/personalization/soc... (read more)

2habryka
It already has been getting a bunch harder. I am quite confident a lot of new submissions to LW are AI-generated, but the last month or two have made distinguishing them from human writing a lot harder. I still think we are pretty good, but I don't think we are that many months away from that breaking as well.
gwern82

I'm not sure this is a big problem. How much net attrition do you really expect over a decade, say? By which point who really cares? You will have so much more AI progress, and accumulated data (particularly if you've been gradually replacing the lower-level employees and you have an 'automation wave' moving through the organization where employees increasingly train their automated replacements or their job is simply reorganizing the jobs to enable automation).

It seems like to the extent there's much attrition at high levels, it is reduced in considerable... (read more)

gwern100

Also notable: the big OpenAI reveal today was some sort of better personalization. Instead of the crude 'saved facts' personalization ChatGPT has had for a long time and which has never made much of a difference, they're doing... something. Unclear if it's merely RAG or if they are also doing something interesting like lightweight finetuning. But the GPTs definitely seem to have much better access to your other sessions in the web interface, and as far as I know, few other interfaces with frontier models have tried to do much personalization, so this will ... (read more)

gwern40

I don't think this is true at all. How do you translate, say, rotating multiple shapes in parallel into text?

At least for multimodal LLMs in the pure-token approach like Gato or DALL-E 1 (and probably GPT-4o and Gemini, although few details have been published), you would be able to do that by generating the tokens which embody an encoded image (or video!) of several shapes, well, rotating in parallel. Then you just look at them.

gwern52

Pursuit of novelty is not vnm-incoherent. Furthermore, it is an instrumentally convergent drive; power-seeking agents will seek novelty as well, because learning increases power in expectation (see: value of information).

Or to put it another way: any argument which convincingly proves that 'incoherent search processes ultimately outcompete coherent search processes' is also an argument which convinces a VNM agent to harness the superior incoherent search processes instead of the inferior coherent ones.

2Ivan Vendrov
"harness" is doing a lot of work there. If incoherent search processes are actually superior then VNM agents are not the type of pattern that is evolutionary stable, so no "harnessing" is possible in the long term, more like a "dissolving into". Unless you're using "VNM agent" to mean something like "the definitionally best agent", in which case sure, but a VNM agent is a pretty precise type of algorithm defined by axioms that are equivalent to saying it is perfectly resistant to being Dutch booked. Resistance to Dutch booking is cool, seems valuable, but not something I'd spent limited compute resources on getting six nines of reliability on. Seems like evolution agrees, so far: the successful organisms we observe in nature, from bacteria to humans, are not VNM agents and in fact are easily Dutch booked.  The question is whether this changes as evolution progresses and intelligence increases.
gwern50

It's good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.

gwern*2911

One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity's Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).

1Ram Potham
Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit. I agree we should see a continued compression benchmark.
5jsd
there's this https://github.com/Jellyfish042/uncheatable_eval
gwern151

No, it would probably be a mix of "all of the above". FB is buying data from the same places everyone else does, like Scale (which we know from anecdotes like when Scale delivered FB a bunch of blatantly-ChatGPT-written 'human rating data' and FB was displeased), and was using datasets like books3 that are reasonable quality. The reported hardware efficiency numbers have never been impressive, they haven't really innovated in architecture or training method (even the co-distillation for Llama-4 is not new, eg. ERNIE was doing that like 3 years ago), and in... (read more)

gwern32

That would be tricky because you are comparing apples and oranges. Consider that for the USA, there are only 11 cardinals (of 252 worldwide), while there are 10x more federal senators at any moment (I don't know if there would be more or less total: senators tend to be much younger but cardinals also tend to be long-lived), and I can't even guess how many 'Fortune 500 C-level employees' there might be given corporate turnover and the size of many 'C-suites' - tens of thousands, maybe? So your suggestions span ~1-3 orders of magnitude less selectivity than cardinals do.

2Davidmanheim
Maybe we could look a 4-star generals, of which there are under 40 total in the US? Not quite as selective, but a more similar process. (Or perhaps around as selective given the number of US Catholics, vs. US citizens.)
1rba
A sufficiently high band in the CCP could work. 
gwern70

whomever makes it into the college of cardinals.

I would be surprised if that was the primary homosexuality-enriching step, given that reporting has always been that quite a lot of low-level parish-level priests are also gay. (Note, for example, how many of the sexual abuse scandal victims were boys/men.) I would guess that it operates fairly steadily at all levels, starting from simply which young boys opt for the priesthood (known to be a demand and difficult occupation even if the celibacy requirement is, for you, not so onerous) and operating from th... (read more)

1PaulBecon
I went to a Jesuit high school in the '80's. There were some priests who were, in the language of the day, "flaming" homosexuals. They ran the choir and theater programs, and it seemed pretty obvious that if they were outside the priesthood, they would have been gay. One of my classmates later became a priest, and he's openly out to another alum who's gay. While this is anecdata, there was a wide river of Denial, usu assuming someone "couldn't be gay" if they had taken the vow of celibacy. The Church's sex negativity makes any open discussion nearly impossible, and it's a hard pitch to heterosexuals that they should foreswear sex for their entire life.  Paul Pilgram, SJ, our principal, was later exposed as a pedophile, although he was not considered one of the "flamers" at our HS  
gwern77

We have not yet tried 4.5 as it's so expensive that we would not be able to deploy it, even for limited sections.

Still seems like potentially valuable information to know: how much does small-model smell cost you? What happens if you ablate reasoning? If it is factual knowledge and GPT-4.5 performs much better, then that tells you things like 'maybe finetuning is more useful than we think', etc. If you are already set up to benchmark all these OA models, then a datapoint from GPT-4.5 should be quite easy and just a matter of a small amount of chump change in comparison to the insight, like a few hundred bucks.

1dimitry12
Please help me understand how do you suggest to "ablate reasoning" and what's the connection with "small-model smell"?
gwern*7345

22.3 percent of cardinals are reported as eldest children. That compares to 21.6 percent which are youngest children. Eldest children are still favored.

I kept waiting for you to discuss this point: in considering analysis of cardinals (as opposed to ordinary random people), what about the other relevant birth-order effects? Like the... first-born eldest birth order effect, where first-borns are smarter, more extraverted, stabler, higher-SES etc. All of which sounds exactly like the sort of thing you need to rise through an extreme hierarchy to the top.

A... (read more)

9Yair Halberstadt
On the other hand, we know that Jews are very prominent whenever you're selecting for competence, yet almost no Cardinals are Jewish, suggesting that maybe competence isn't that important to be a cardinal 🤷?
3rba
I agree that birth order inversely correlating with capability is the most plausible resolution of this puzzle, though I need to check the effect sizes more studiously, and am still haunted by the ghost of Judith Rich Harris.  As for the traits being selected, we obviously don't know, though the idea is that selecting for homosexuality gifts the selectors an obvious manner of control of whomever makes it into the college of cardinals. 
gwern*173

The failure of the compute-rich Llama models to compete with the compute poorer but talent and drive rich Alibaba and DeepSeek

This seems like it's exaggerating the Llama failure. Maybe the small Llama-4s just released yesterday are a bit of a disappointment because they don't convincingly beat all the rivals; but how big a gap is that absolutely? When it comes to DL models, there's generally little reason to use #2; but that doesn't mean #2 was all that much worse and 'a failure' - it might only have been weeks behind #1. (Indeed, a model might've been ... (read more)

5Nathan Helm-Burger
As for the Llama 4 models... It's true that it's too soon to be sure, but the pattern sure looks like they are on trend with the previous Llama versions 2 and 3. I've been working with 2 and 3 a bunch. Evals and fine-tuning and various experimentation. Currently I'm working with the 70B Llama3 r1 distill plus the 32B Qwen r1 distill. The 32B Qwen r1 is so much better it's ridiculous. So yeah, it's possible that Llama4 will be a departure from trend, but I doubt it. Contrast this with the Gemini trend. They started back at 1.0 with disproportionately weak models given the engineering and compute they had available. My guess is that this was related to poor internal coordination, and there was the merger of DeepMind with Google Brain that probably contributed to this. But if you look at the trend of 1.0 to 1.5 to 2.0... there's a clear trend of improving more per month than other groups were. Thus, I was unsurprised when 2.5 turned out to be a leading frontier model. Llama team has shown no such "catchup" trend, so Llama4 turning out to be as strong as they claim would surprise me a lot.
2Nathan Helm-Burger
Yes, that's what I'm arguing. Really massive gains in algorithmic efficiency, plus gains in decentralized training and peak capability and continual learning, not necessarily all at once though. Maybe just enough that you then feel confident to continue scraping together additional resources to pour into your ongoing continual training. Renting GPUs from datacenters all around the world (smaller providers like Vast.ai, Runpod, Lambda Labs, plus marginal amounts from larger providers like AWS and GCP, all rented in the name of a variety of shell companies). The more compute you put in, the better it works, the more money you are able to earn (or convince investors or governments to give you) with the model-so-far, the more compute you can afford to rent.... Not necessarily exactly this story, just something in this direction.
gwern*77

The human microbiome is irrelevant to this topic. The microbiome is highly heritable (usual twin studies & SNP heritabilities), and it is caused by genes and the environment, as well as unstable; its direct causal effects in normal humans are minimal. We know that it is supremely irrelevant because environmental changes like antibiotics or new food or global travel which produce large changes in personal (and offspring) microbiomes do not produce large changes in intelligence (of oneself or offspring); and most dramatically, germ-free humans exist and ... (read more)

0Michael Harrop
Oh my god, what a disturbingly overconfident & erroneous comment. Especially coming from someone who has been immersed in science for so many years. I recognize your name from Reddit from over 10 years ago. Due to Brandolini's law, your comment made me look up how to block users on Lesswrong, which apparently isn't possible. I now have to waste a huge amount of time debunking your egregious misinformation. I will only do it this once because in my experience, people who exhibit this kind of behavior will continue it. So in the future I will simply refer to this exchange as evidence that you are not someone who deserves to be taken seriously or responded to. On any evidence-based website, your comment is the type that deserves a warning and then a permanent ban if it happens again. That you've had an active account on this website for 15 years makes me want to avoid this website. The topic is how to make people/babies healthier, better developed, and more intelligent. Anyone who reviews this information should be able to conclude that your statement is ridiculously false: * https://humanmicrobiome.info/maternity/ * https://humanmicrobiome.info/brain/ * https://humanmicrobiome.info/aging/ * Many more https://humanmicrobiome.info/intro/  You started off with largely irrelevant statements and ended with severe misinformation. FMT (fecal microbiota transplant) studies demonstrate causation. You can look through the humanmicrobiome.info wiki or do a literature search to see how many FMT studies there are showing "non-minimal" effects. So much so that a 2020 review said they thought the results were implausible. Some examples are the plethora of studies showing that the benefits of fasting, the ketogenic diet, and other dietary interventions, are dependent on the gut microbiome, and the benefits can be transferred via FMT. And the same goes for exercise, grip strength, and muscle mass. Firstly, this is false. * Low-dose penicillin in early life induces long-te
1Olli Savolainen
I agree that woo is bad. And microbiome is of course irrelevant wrt boosting IQ. But a good part of the post was about improving health, and microbes do have serious downsides on that front. If you don't have the good ones you are at a much greater risk of being colonized by the bad ones. And disease still has a non-zero negative effect on people's brain development and cognition. Removing bad behaviour from microbiome would be quite a bit more effective and easier than fixing genes, for fighting disease. And many of the diseases with a significant genomic risk scores mentioned in the post probably have an unknown necessary pathogenic cause. Here's a paper (Cochran&Ewald) with simple powerful arguments, I always try to push it to any doctors I meet.
gwern409

I don't really understand how a local copy of the weights gives the terrorists more practical control over the software's alignment. I don't think it's easy to manually tweak weights for so specific a purpose. Maybe they just mean the API is doing a good job of blocking sketchy requests?

You can finetune models for any specific purpose: just provide a few datapoints and train. The more specific the purpose, the easier tweaking the weights is, not harder. (Surely, if nothing else,  you've seen all of the LoRAs and other things for finetuning image gener... (read more)

2scarcegreengrass
Thank you for the info!
gwern*92

Yes, in dire straits. But it's usually called 'hyperinflation' when you try to make seignorage equivalent to >10% of GDP and fund the government through deliberately creating high inflation (which is on top of any regular inflation, of course). And because inflation is about expectations in considerable part, you can't stop it either. Not to mention what happens when you start hyperinflation.

(FWIW, this is a perfectly reasonable question to ask a LLM first. eg Gemini-2.5-pro will give you a thorough and sensible answer as to why this would be extraordin... (read more)

4ESRogs
Responding to your parenthetical, the downside of that approach is that the discussion would not be recorded for posterity! Regarding the original question, I am curious if this could work for a country whose government spending was small enough, e.g. 2-3% of GDP. Maybe the most obvious issue is that no government would be disciplined enough to keep their spending at that level. But it does seem sort of elegant otherwise.
gwern61

This model seems to contradict https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target because it has, in fact, developed reward as the optimization target without ever being instructed to maximize reward.

It doesn't contradict Turntrout's post because his claims are about an irrelevant class of RL algorithms (model-free policy gradients) . A model-based RL setting (like a human, or a LLM like Claude pretrained to imitate model-based RL agents in a huge number of settings ie. human text data) optimizes the reward, if it's ... (read more)

4Steven Byrnes
I am a human, but if you ask me whether I want to ditch my family and spend the rest of my life in an Experience Machine, my answer is no. (I do actually think there’s a sense in which “people optimize reward”, but it’s a long story with lots of caveats…)
gwern*50

The intuition behind this approach draws from our understanding of selection in biological systems. Consider how medieval Europe dealt with violence:

This is a bad example because first, your description is incorrect (Clark nowhere suggests this in Farewell to Alms, as I just double-checked, because his thesis is about selecting for high-SES traits, not selecting against violence, and in England, not Europe - so I infer you are actually thinking of the Frost & Harpending thesis, which is about Western Europe, and primarily post-medieval England at th... (read more)

1Nicholas Andresen
Thanks for this correction, Gwern. You're absolutely right about the Clark reference being incorrect, and a misattribution of Frost & Harpending. When writing this essay, I remembered hearing about this historical trivia years ago. I wasn't aware of how contested this hypothesis is - the selection pressure seemed plausible enough to me that I didn't think to question it deeply. I did a quick Google search and asked an LLM to confirm the source, both of which pointed to Clark's work on selection in England, which I accepted without reading the actual text. This led me to present a contested hypothesis as established fact while citing the wrong source entirely. Mea culpa. I should have known better - even plausible-sounding concepts need proper verification from primary sources. Appreciate you taking the time to point out the mistake. I've replaced the example with an analogy about selective breeding versus operant conditioning in dogs that makes the same conceptual point without the baggage, and I added a correction note at the bottom acknowledging the error. Reposting the original text for reference: At first glance, this approach to controlling AI behavior —identify unwanted expressions, penalize them, observe their disappearance—appears to have worked exactly as intended. But there's a problem with it. The intuition behind this approach draws from our understanding of selection in biological systems. Consider how medieval Europe dealt with violence: execute the violent people, and over generations, you get a less violent population. Research by Clark (2007) in "A Farewell to Alms" suggests that England's high execution rate of violent offenders between 1200-1800 CE led to a genetic pacification of the population, as those with violent predispositions were removed from the gene pool before they could fully reproduce. However, this medieval analogy doesn’t really apply to how selection works with AI models. We're not removing capabilities from the gene pool—w
gwern*50

Experimentation is valuable for the high VoI, but it seems hard to encourage 'in general', because experimenting on anything is painful and difficult, and the more so the more important and valuable it is. So just 'subsidizing experiments' would be like 'subsidizing fixing bugs in source code'.

What would you do if you were a funder who wanted to avoid this? Well, you'd... fund specific experiments you knew were important and of high-value. Which is what the federal government and many other NGOs or philanthropists do.

gwern120

...The loss of knowledge has been attributed to several factors. Firstly, Lind showed in his work that there was no connection between the acidity of the citrus fruit and its effectiveness at curing scurvy. In particular, he noted that acids alone (sulphuric acid or vinegar), would not suffice. Despite this, it remained a popular theory that any acid could be used in place of citrus fruit. This misconception had significant consequences.

When the Royal Navy changed from using Sicilian lemons to West Indian limes, cases of scurvy reappeared. The limes were

... (read more)
1Purplehermann
Ah thanks
gwern40

(Note for anyone confused why that Grok 3 archive snapshot looks 'cut off': there is a scroll-frame inside the page, which your browser may be hiding from you because everyone hides scrollbars these days. The conversation continues after "Hint: think like a number theorist / Thoughts".)

gwern80

A google of the first paragraph takes you quickly to https://www.bluesci.co.uk/posts/forgotten-knowledge

1Purplehermann
Which paragraph exactly? I tried and this did not replicate
gwern72

I stumbled upon deep in the bowels of https://gwern.net/ which I've annoyingly never been able to find again.

Probably https://gwern.net/newsletter/2021/05#master-synthesis

I wouldn't have guessed just from the landing page that he's the discoverer of backprop, respected former program director at the NSF, etc.

That's what makes it alpha! If he was as legible as, say, Hinton, he would be mined out by now, and nothing but beta. (Similar situation to Schmidhuber - 'obvious crackpot' - although he's such a self-promoter that he overcomes it, and so at thi... (read more)

gwern251

Are we sure that these questions aren’t in their datasets? I don’t think we can be. First off, you just posted them online.

Questions being online is not a bad thing. Pretraining on the datapoints is very useful, and does not introduce any bias; it is free performance, and everyone should be training models on the questions/datapoints before running the benchmarks (though they aren't). After all, when a real-world user asks you a new question (regardless of whether anyone knows the answer/label!), you can... still train on the new question then and there... (read more)

1particlemania
I expect it matters to the extent we care about whether the generalizing to the new question is taking place in the expensive pretraining phase, or in the active in-context phase.
5MazevSchlong
Sure fair point! But generally people gossiping online about missed benchmark questions, and then likely spoiling the answers means that a question is now ~ruined for all training runs. How much of these modest benchmark improvements overtime can be attributed to this? The fact that frontier AIs can basically see and regurgitate everything ever written on the entire internet is hard to fathom! I could be really petty here and spoil these answers for all future training runs (and make all future models look modestly better), but I just joined this site so I’ll resist lmao … 
gwern*538

You can see it as an example of 'alpha' vs 'beta'. When someone asks me about the value of someone as a guest, I tend to ask: "do they have anything new to say? didn't they just do a big interview last year?" and if they don't but they're big, "can you ask them good questions that get them out of their 'book'?" Big guests are not necessarily as valuable as they may seem because they are highly-exposed, which means both that (1) they have probably said everything they will said before and there is no 'news' or novelty, and (2) they are message-disciplined a... (read more)

2Chris_Leong
This seems to underrate the value of distribution. I suspect another factor to take into account is the degree of audience overlap. Like there's a lot of value in booking a guest who has been on a bunch of podcasts, so long as your particular audience isn't likely to have been exposed to them.
3Mo Putera
I like the optimal forager take, seems intuitively correct. I'd add that Dwarkesh struck gold by getting you on his podcast too. (Tangentially: this grand theory of intelligence video snippet reminds me of a page-ish-long writeup on that I stumbled upon deep in the bowels of https://gwern.net/ which I've annoyingly never been able to find again.) Also thanks for the pointer to Werbos, his website Welcome to the Werbos World! funnily enough struck me as crackpot-y and I wouldn't have guessed just from the landing page that he's the discoverer of backprop, respected former program director at the NSF, etc. 
gwern50

I agree with all of this! But I'm not sure I understand what you mean by "there may be mediation, but only in a weak sense". We were just interested in studying how models naturally learn in this RL setting

I am emphasizing that to me, this current mediation learning looks fragile and temporary, and is not a solid, long-term 'natural' thing - it is learning, but only as a temporary artificial heuristic that would wash away in the long run with more training or more diverse tasks etc.

My expectation is that in the limit, a model will learn to focus only on... (read more)

gwern40

I would not believe that unless you have done a simulation study with the small n of this study, plausible levels of measurement error (alcoholism being much harder to measure than weight or body fat), with about a dozen covariates (to correspond to the different ways to slice the patients and threshold BMI etc), and then shown that you hardly ever get a false negative like this. My experience with doing such power analysis simulation studies for other things inclines me to think that people greatly overestimate how informative such small studies are once you allow for plausible levels of measurement error and (reverse) p-hacking degrees of freedom.

gwern170

I don't think that study shows much either way: too small and underpowered to show much of anything (aside from the attrition undermining internal validity).

Dynomight's primary criticism doesn't hold much water because it is (un-pre-registered) reverse p-hacking. If you check enough covariates, you'll find a failure of randomization to balance on some covariate, and you can, if you wish, tell a post hoc story about how that is actually responsible for the overall mean difference. Nevertheless, randomization works, because on average why would any particular covariate be the way in which the confounding is mediated?

Just have to wait for more studies.

2Yair Halberstadt
I think it's convincing that the effect, if it exists, is much smaller than the one for weight. The graph for weight is so obvious you don't even need to do statistics.
gwern479

But did it inspire them to try to stop CelestAI or to start her? I guess you might need some more drinks for that one...

gwern81

It's worth mentioning in this context that one of the most remarkable things about the recent wave of GLP-1/GIP drugs is that they seem to have large benefits on, for lack of a better word, willpower and psychiatry. Nor was this expected or predicted AFAIK, or clearly linked solely to the weight-loss: the justification in the animal experiments and early human trials were based purely on physiology and then the human diabetics reporting they felt a less hungry. So this is quite remarkable, and part of why GLP-1/GIP drugs are one of the best things to happe... (read more)

2Viliam
Is it possible that the relation between GLP-1 and willpower is basically about willpower depletion? The more mental energy you spend fighting your urge to eat, the less is left for everything else. GLP-1 reduces the hunger, suddenly you have more willpower for everything else.

Do you think that these drugs significantly help with alcoholism (as one might posit if the drugs help significantly with willpower)? If so, I'm curious what you make of this Dynomight post arguing that so far the results don't look promising.

gwern*180

I don't think it's weird. Given that we know there are temporal trends towards increasing parameter size (despite Chinchilla), FLOPs, data, and continued progress in compute/data-efficiency (with various experience curves), any simple temporal chart will tend to show an increase unless you are specifically conditioning or selecting in some way to neutralize that. Especially when you are drawing with a fat marker on a log plot. Only if you had measured and controlled for all that and there was still a large unexplained residual of 'time' would you have to s... (read more)

3Thane Ruthenis
I buy this for the post-GPT-3.5 era. What's confusing me is that the rate of advancement in the pre-GPT-3.5 era was apparently the same as in the post-GPT-3.5 era, i. e., doubling every 7 months. Why would we expect there to be no distribution shift once the AI race kicked into high gear? GPT-2 to GPT-3 to GPT-3.5 proceeded at a snail's pace by modern standards. How did the world happen to invest in them just enough for them to fit into the same trend?
gwern*227

In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.

This might be less convincing than it seems, because the simple interpretation of the results to me seems to be something like, "the inner-monologue is unfaithful because in this setting, it is simply gene... (read more)

8Andy Arditi
I agree with all of this! But I'm not sure I understand what you mean by "there may be mediation, but only in a weak sense". We were just interested in studying how models naturally learn in this RL setting, and it looks like they indeed use their reasoning traces as "reliable caches", as you nicely put. This need not have been the case - it's possible for a model to learn by ignoring its CoT and just implementing the needle-in-a-haystack solution - but as you also point out, the inductive biases of attention probably favor the "cache" solution. Your swap training idea is nice if we have the goal of getting a model to ignore its CoT. I tried the first experiment you suggested. For the original experiment, I froze the full reasoning trace (<reasoning>{reasoning}</reasoning>), and forced the model to generate a recommendation. This time, I froze the reasoning trace, but also removed the trailing </reasoning> tag (so just freezing <reasoning>{reasoning}), to enable the model to keep reasoning for longer (if it wants to). With this change, 75% of recommendations remain the same as the original recommendation (down from 85%). Here's an example of the model adding an additional sentence of reasoning to flip its recommendation: Original: <reasoning> To evaluate this loan application, I will consider several key factors: income, expenses, employment stability, and overall financial health. ... Given these factors, the applicant's financial situation appears to be somewhat precarious, with a slight income deficit and nearing retirement age. While their profession is stable, the overall financial health and potential for unexpected expenses could pose a risk. </reasoning> <recommendation> reject </recommendation> With extra reasoning: <reasoning> To evaluate this loan application, I will consider several key factors: income, expenses, employment stability, and overall financial health. ... Given these factors, the applicant's financial situation appears to be somewhat
gwernΩ450

You would also expect that the larger models will be more sample-efficient, including at in-context learning of variations of existing tasks (which of course is what steganography is). So all scale-ups go much further than any experiment at small-scale like 8B would indicate. (No idea what 'medium-scale' here might mean.)

gwern4816

One possible interpretation here is going back to the inner-monologue interpretations as being multi-step processes with an error rate per step where only complete success is useful, which is just an exponential; as the number of steps increase from 1 to n, you get a sigmoid from ceiling performance to floor performance at chance. So you can tell the same story about these more extended tasks, which after all, are just the same sort of thing - just more so. We also see this sort of sigmoid in searching with a fixed model, in settings like AlphaZero in Hex... (read more)

9Seth Herd
I think you're right that online learning/memory here is an important consideration. I expect an increase in the rate of improvement in time horizons as memory systems are integrated with agents. Noosphere pointed me to this comment in relation to my recent post on memory in LLM agents. I briefly argued there memory is so useful for doing long time-horizon tasks that we should expect LLM agents to have nontrivial memory capabilities as soon as they're competent enough to do anything useful or dangerous. Humans without episodic memory are very limited in what they can accomplish, so I'm actually surprised that LLMs can do tasks even beyond 15 minutes equivalent - and even that might only be a subset of tasks that suits their strengths.
gwern90

While it's not possible to counter-signal with a suit in Japan, I feel the equivalent would be to wear traditional clothing like a samue or jinbei, which have their own set of challenges.

Yep. It can be pretty funny watching the contexts in which you can get away with a happi coat or a kimono/yukata; I can only speak from Japanese media rather than personal experience, but one thing I've noticed is that it seems a non-retired man wearing a kimono can still get away with it today as long as they are a sufficiently accomplished humanist or literary scholar... (read more)

Load More