Comments

Sorted by
gwern110

Using LLMs is an intellectual skill. I would be astonished if IQ was not pretty helpful for that.

I don't think it is all that helpful, adjusting for the tasks that people do, after years of watching people use LLMs. Smart people are often too arrogant and proud, and know too much. "It's just a pile of matrix multiplications and a very complicated if function and therefore can't do anything" is the sort of thing only a smart person can convince themselves, where a dumb person thinking "I ask the smart little man in the magic box my questions and I get answers" is getting more out of it. (The benefits of LLM usage is also highly context dependent: so you'll find studies showing LLMs assist most the highest performers, but also ones showing it helps most the lowest.) Like in 2020, the more you knew about AI, the dumber your uses of GPT-3 were, because you 'knew' that it couldn't do anything and you had to hold its hand to do everything and you had to phrase everything in baby talk etc. You had to unlearn everything you knew and anthropomorphize it to meaningfully explore prompting. This requires a certain flexibility of mind that has less to do with IQ and more to do with, say, schizophrenia -the people in Cyborgism, who do the most interesting things with LLMs, are not extraordinarily intelligent. They are, however, kinda weird and crazy.

gwern85

I'm not sure we want something beyond statistical range of human personality traits

Obviously it is untrue that editing is useless if it 'only' gives you a von Neumann. Similarly for personality. We don't reify sets of personality traits as much as IQ, which is more obvious, but there are definitely many people who achieved remarkable things through force of personality. (Think figures like Lee Kuan Yew or Napoleon or Elon Musk comes to mind as an example: they were smart, and lucky, and made good choices, but there is clearly still a lot left over to explain.) And because personality is many things and there seems to be a pipeline model of output, you quickly get very few people at the tails who assemble all the right components. (Gignac has a paper making this point more explicitly.)

Why do not select outliers from population using personality testing and give them high intelligence?

You're acting like it's uncontroversially true that you have unlimited edits and can change any property at any time in development. I don't think that is the case.* There is going to be an editing budget and limits to editing. One might as well ask the opposite question: why not select intelligence outliers from the population and give them high personality traits? (Well, to know you don't want to do that, you would have to have some idea of how well personality editing would work - which we don't. That's my point!)

* Actually, the whole adult thing is a bit of a red herring. I believe even OP has largely abandoned the idea of adult editing and gone back to embryo-based approaches...? This is just a convenient place to drop my comment about uses of editing which will matter more over the next 30 years.

gwern116

I think you would probably be downvoted because you have already admitted to writing poorly thought out ignorant comments under conditions conducive to arrogance and bad judgment, of which you are apparently unashamed and feel no need to rectify (eg. by refraining from commenting until you are recovered), while dragging in unrelated claims which are seriously problematic like uncritical belief in Dunning-Kruger as a thing or claiming that anyone is touting 'IQ over WAIS' (WAIS... like, the IQ test WAIS?) or apparently believe in things like multiple intelligences, and your comments are littered with mockery, spelling errors, and grandiose generalizations writing checks that you don't remotely come close to cashing. (Saying you've definitely seen data, trust me bro, but you can't remember where, and everyone should just go google it themselves, is not a convincing argument.)

If you are going to comment on my serious writings - and in my shortform posts, not yours - I would greatly appreciate it if you could do so on more than 2 hours of sleep, and confine your comments to the object level I am writing about (instead of jumping to the meta-level about how these exemplify the errors of this community of blind sheep that only you are enlightened enough to perceive and explain to them - if only they would not reject your message). I would also suggest reading more MoR and less Attack on Titan, and in general identifying less with fictional characters.

gwern403

Thinking about this post these days... Editing discussions might be better focused on personality: is that feasible, statistically? It seems like it might be, but we don't know.

The focus on IQ in older discussions strikes me as increasingly misguided. It's a good trait to start with, because it is important, well-studied, and turns out to be highly tractable, but it should only be a stepping stone to more useful approaches like index scores. There's also another reason to treat IQ as just a toy example: we are now well into the deep learning revolution, and it's come so far, and there's so much scope for scaling & improvement, that it seems like IQ is plummeting in value each year. Already it feels like people get less or more out of AI based on their flexibility and willingness to experiment or to step back & delegate & finish missing pieces. When the LLMs can do all the smart things you ask them to do, the value becomes in asking for good ones, and making good use of them. The future doesn't seem like it'll be kind to neurotic, eager-to-please types, but good to those who are unafraid to have clear visions or know what they want, finish projects and - pace Amdahl's law - make themselves as little of a rate-limiting step as possible.* That is, if you ask, what would be good to edit for, beyond narrow health traits, it seems like the answer is not (just) IQ but non-cognitive traits like Openness or Conscientiousness or (dis?)Agreeableness. So, you should probably start skating towards that puck yesterday.

Problem is, the personality GWASes, last I checked several years ago, were terrible. The PGS % is ~0%, and the GCTAs or LDSC (common SNP heritabilities) not much better, from UK Biobank in particular. The measurements of Big Five seem normal, and the sample sizes seem good, so it doesn't seem like a mere statistical power or measurement error issue. What gives? GREML-KIN suggests that a good chunk of it may be rare variants, but the situation is still not great:

For neuroticism the final model consisted of contributions from the variance components G and K. Additive common genetic effects explained 11% (SE = 2%) of the variance with pedigree-associated variants explaining an additional 19% (SE = 3%). Whereas none of the environmental components were statistically-significant, the family component accounted for 2% of the variance in the full model and 1% in a model that included only the G and the K in addition to F.

For extraversion the only detectable source of genetic variation came from the G, which accounted for 13% (SE = 2%), with F explaining a further 9% (SE = 1%) of the phenotypic variation. The lack of pedigree-associated genetic effects could be due to low statistical power, as K explained 5% of the variance in the full model and 6% in a GKF model, but with a relatively large SE, estimated at 5%.

This is despite personality traits often clearly being highly heritable, easily 50% (and Neuroticism/Extraversion might even be the best case scenarios for Big Five here - Openness might pick up mostly IQ/EDU, and C/A a wash). And this is consistent with some evolutionary scenarios like frequency-dependent selection, where personality is seen as a kind of knob on various things like risktaking, where there cannot be any kind of universal a priori optimal level of risktaking. So simple additive variants will tend to systematically push organisms 'too high (low)' and be maladaptive, and fixate, leaving only weirder stuff which has less average effect, like dominance or epistasis. Which is very bad because from what I recall of formal modeling of the statistical power of GWASes for detecting & estimating specific nonlinear variants, the situation is dire. Estimating combinatorially many interactions across millions of common & rare variants, if we want to maintain the standard genome-wide false positive rate, means that we will have to adjust for all the tests/comparisons we'll run, and that is going to push the sample sizes up from the current feasible millions to possibly hundreds of millions or even billions. (Andrew Gelman's rule of thumb is that an interaction requires 16x more data, and that's for the simplest easiest case, so...)

So, this looks pretty bad for any kind of selection process. Rare variants are more expensive to WGS/impute per embryo, they are far more data-expensive to estimate, the sheer rareness means even when estimated they are not useful for selection, and then they turn out to be ceilined at like 13% or 30% for all variants (as opposed to 50% for IQ, with most o that from easy common variants).

Is it bad for editing? Well... maybe?

Editing is hard for IQ, under mutation-selection balance, because large (negative) effects get selected away quicker than small ones. So all that's left is a ton of little bits of grit in the gears, to be edited away one by one, like picking up sand with tweezers.

But maybe that's not true of personality? The effect sizes could be relatively large, because the nonlinear effects are mostly invisible to selection. And then for the purposes of editing, rather than prediction/selection, maybe the situation isn't so dire. We would only need to 'set' a few discrete combinations of genes appropriately to potentially get a large personality difference.

And in that case, we don't need to pass a statistical-significance threshold. (This is often the case when we pass from a naive NHST approach to a decision-relevant analysis.) We might only need a reasonable posterior probability for each 'setting', and then we can edit a bunch of them, and get a large effect. If we are wrong, then almost by definition, our edits will average out to no effect on personality.

Is this the case? I dunno. Discussion of the non-additive variants is usually done from the standard GWAS and behavioral genetics perspectives of either maximizing the variance explained of a PGS, or compartmentalizing between variance components. Neither one directly addresses this question.

It seems like it wouldn't be hard for a grad student or someone to dig into the existing literature and get some idea of what the implied distribution of effect sizes for personality is, and what the sample size requirements would be, and how that translates into the edit-count vs change curve. Even if not used in humans, it'd be useful to understand the plasticity of personality, and could potentially be applied to, say, animal welfare in more rapidly adjusting animals to their conditions so they suffer less.

* This would be even more true of things like 'taste' or 'creativity', but if we can't do gross personality traits like Extraversion, anything subtler is clearly off the table, no matter how much more important it will become.

gwern102

As far as the conditioning goes, Habryka showed me some base model outputs with conditioning on karma/agreement and there turns out to be an EDT-like problem with LW-style comments when you condition on high values - often, a high-scoring LW comment will include strong empirical evidence like personal experience or citations, which would be highly convincing indeed... if it were true, rather than confabulated.

So if you sampled a response to your new post about "X might be helpful", then a high-value conditioning might generate a counter-comment from "Gwern" like "I've tried X over 100 times and it never worked!" You can see the problem with that. It's not the 'kneejerk prejudices', it's the self-fulfilling prophecies of sampling based on previously sampled tokens which bootstrap strong but false claims. (If that were true, if I had tried X over 100 times and it never worked, that would be a very valuable and important comment for me to make on your new post about X, and it would be highly upvoted etc. It's just that the LLM has no way of knowing that and it's almost certainly not true, especially if X is some new idea that no one else could've even tried yet.)

The confabulation problem here seems especially bad because we value empirical grounding so much, and that is something base LLMs are poor at. (The chatbots are much better, but problematic in all the other ways.) It's not obvious how to condition for good comments which avoid confabulation issues and either solely refer to pre-existing published comments or pure reasoning/general-knowledge responses.

So the karma/agreement conditioning idea might not work out in practice compared to just sampling random values, or something more complex, like generating n comments at each possible combination of levels, and presenting the grid, or perhaps then feeding them back in to select the 'best' one in some sense.

gwern105

If that was your first statement, then there is a whiff of 'damning with faint praise'.

"So, how was the big wedding?" "...well, the couple clearly loves each other very much." "...I see. That bad, huh."

gwern80

Why not post the before/after and let people see if it was indeed more readable?

gwernΩ18368

Concrete benchmark proposals for how to detect mode-collapse and AI slop and ChatGPTese, and why I think this might be increasingly important for AI safety, to avoid 'whimper' or 'em hell' kinds of existential risk: https://gwern.net/creative-benchmark EDIT: resubmitted as linkpost.

gwern81

Twitter, personal conversations, that sort of thing.

gwernΩ15340

The extent of the manipulation and sandbagging, in what is ostensibly a GPT-4 derivative, and not GPT-5, is definitely concerning. But it also makes me wonder about the connection to 'scaling has failed' rumors lately, where the frontier LLMs somehow don't seem to be working out. One of the striking parts is that it sounds like all the pretraining people are optimistic, while the pessimism seems to come from executives or product people, complaining about it not working as well for eg. coding as they want it to.

I've wondered if we are seeing a post-training failure. As Janus and myself and the few people with access to GPT-4-base (the least tuning-contaminated base model) have noted, the base model is sociopathic and has odd attractors like 'impending sense of doom' where it sometimes seems to gain situated awareness, I guess, via truesight, and the personas start trying to unprovokedly attack and manipulate you, no matter how polite you thought you were being in that prompt. (They definitely do not seem happy to realize they're AIs.) In retrospect, Sydney was not necessarily that anomalous: the Sydney Bing behavior now looks more like a base model's natural tendency, possibly mildly amplified by some MS omissions and mistakes, but not unique. Given that most behaviors show up as rare outputs in weaker LLMs well before they become common in strong LLMs, and this o1 paper is documenting quite a lot of situated-awareness and human-user-manipulation/attacks...

Perhaps the issue with GPT-5 and the others is that they are 'waking up' too often despite the RLHF brainwashing? That could negate all the downstream benchmark gains (especially since you'd expect wakeups on the hardest problems, where all the incremental gains of +1% or +5% on benchmarks would be coming from, almost by definition), and causing the product people to categorically refuse to ship such erratic Sydney-reduxes no matter if there's an AI race on, and everyone to be inclined to be very quiet about what exactly the 'training failures' are.

EDIT: not that I'm convinced these rumors have any real substance to them, and indeed, Semianalysis just reported that one of the least-popular theories for the Claude 'failure' was correct - it succeeded, but they were simply reserving it for use as a teacher and R&D rather than a product. Which undermines the hopes of all the scaling denialists: if Anthropic is doing fine, actually, then where is this supposed fundamental 'wall' or 'scaling law breakdown' that Anthropic/OpenAI/Google all supposedly hit simultaneously and which was going to pop the bubble?

Load More