Comments

Sorted by
gwern41

If an eval is mandated by law, then it will be run even it required logprobs.

I won't hold my breath.

I think commercial companies often would open up raw logprobs, but there's not much demand, the logprobs are not really logprobs, and the problem is the leading model owners won't do so, and those are the important ones to benchmark. I have little interest in the creativity of random little Llama finetunes no one uses.

gwern62

but if trained well such models' idea of aesthetic quality is at least pretty close to most human judgements

That does not follow. Preference learning involves almost no learning of preferences. A suit cut to fit all may wind up fitting none - particularly for high-dimensional things under heavy optimization, like, say, esthetics, where you want to apply a lot of selection pressure to get samples which are easily 1-in-10,000 or rarer, and so 'the tails come apart'.

(How much variance is explained by individual differences in preference learning settings like comparing image generators? A great question! And you'll find that hardly any one has any idea. As it happens, I asked the developer of a major new image generator this exact question last night, and not only did he have no idea, it looked like it had never even occurred to him to wonder what the performance ceiling without personalization could be or to what extent all of the expensive ratings they were paying for reflected individual rater preferences rather than some 'objective' quality or if they were even properly preserving such metadata rather than, like it seems many tuning datasets do, throwing it out as 'unnecessary'.)

but if trained well such models' idea of aesthetic quality is at least pretty close to most human judgements....Then you just need a large number of high quality human judgements from a representative cross-section of people with good taste in poetry/prose/fiction: hiring professional human editors or literary talent scouts seems like a good idea. One of the good things about foundation model sizes and training costs going up is that reasonable budgets for fine-tuning should also increase proportionately.

No. This is fundamentally wrong and what is already being done and what I am criticizing. There is no single 'taste' or 'quality'. Individual differences are real.{{citation needed}} People have different preferences.{{citation needed}} No change in the 'cross-section' changes that (unless you reduce the 'people' down to 1 person, the current user). All you are doing is again optimizing for the lowest common denominator. Changing the denominator population doesn't change that.

Seriously, imagine applying this logic anywhere else, like food!

Another option would be to train or fine-tune the quality scoring model used for the RL on literary sources (books, poetry, etc) with quality labels drawn from relatively objective existing data, like total sales, literary awards, critical rankings, reviews from good reviewers, and so forth...So the obvious approach for finer-grained style control would be to train or fine-tune on a training set of a large number documents each of which consists of a prompt-like description/review/multiple reviews of a literary work, giving a variety of different types of aesthetic opinions and objective measures of its quality, followed by the corresponding literary work itself.

Conditioning won't change the mode collapse, except as you are smuggling in individuals by the backdoor like developing an implicit model of individual reviewers' preferences. (In which case, far better to just condition on all individuals...)

and generally optimizing such things too hard leads to sameness ...The RLHF approach only trains a single aesthetic, and probably shouldn't be taken too far or optimized too hard

Well, yes, that's the problem. It has been taken too far and optimized too hard for a single quality score, and that's where we are now already. How do we provide better benchmarks where optimizing harder won't just worsen the problem?

gwern61

I am familiar with Schmidhuber's ideas, yes. But I had to come up with these alternatives because his would not work here, and I'm not sure they work anywhere.

His compression acceleration metric isn't too useful here, and most forms of 'compression' (or anything involving a likelihood) are not helpful here at all, because you don't have access to anything like that in most cases. For example, ChatGPT doesn't give you the full logits (actually, I'm not sure if they give it at all - I recall OA saying they were planning to expose them again in a very limited fashion but not if they actually did), and tuned models don't have logits, they have value estimates, which used to be log-likelihood-related logits but no longer are.

Any diversity/creativity benchmark which can't be run on ChatGPT & Claude & Gemini is dead on arrival and of no interest to me. We don't need numbers from the open-weights models, we need numbers on the models being used the most at the frontier and generating the most tokens worldwide that you'll be reading forever - the closed models, which do not give you such things as logits or whitebox finetuning etc. If it can't be done by calling a standard text completion API, then I ignored it.

I am also doubtful that the compression metrics really work at finite samples or capture what we mean by creativity in generative models. Like all of Schmidhuber's work, he has never gotten it working on more than toy problems (if even that), and when I look at actual compression losses on text, like gzip passages or the OA Playground highlighting words by their log likelihood, the high perplexity tokens or passages bear little resemblance to what I would consider 'interesting' or 'surprising'. (This is related to the question of 'if predicting tokens induces intelligence, and LLMs are now superhuman at predicting random Internet tokens, why are LLMs still not superhumanly intelligent?') People also try running compression metrics on programming language source code, and you get results like "Javascript is the best programming language", which is... counterintuitive, to say the least. So I am unsure his compression metrics would work without a lot of revising, while my proposed metrics seem a lot less risky and to map more directly onto what creative thinkers want out of generative models.

gwern51

Yeah, it's definitely something of a deus ex machina gimmick. Tsukasa just plain lost and the logical ending is for him to be stoned or killed - but gosh darn it, they just wanted him around too much and to redeem him somehow, so hey! here's this other thing which has not been foreshadowed or meaningfully written into his characterization or worldbuilding, like, at all. I rolled my eyes when I saw that twist coming. Even given the Dr Stone formula of wildly swerving between 'shonen' and 'Robinsonade', it was poorly done.

(Ryusui would've been a better negation of Tsukasa, but would have been tricky to to make that work. If you are at the point his naval skills really matter, the Tsukasa war has to be over already as the exponential cascade should've already long passed irreversibility by the time you have the manpower to build sailing ships rather than, say, a dugout canoe.)

gwern121

Using LLMs is an intellectual skill. I would be astonished if IQ was not pretty helpful for that.

I don't think it is all that helpful, adjusting for the tasks that people do, after years of watching people use LLMs. Smart people are often too arrogant and proud, and know too much. "It's just a pile of matrix multiplications and a very complicated if function and therefore can't do anything" is the sort of thing only a smart person can convince themselves, where a dumb person thinking "I ask the smart little man in the magic box my questions and I get answers" is getting more out of it. (The benefits of LLM usage is also highly context dependent: so you'll find studies showing LLMs assist most the highest performers, but also ones showing it helps most the lowest.) Like in 2020, the more you knew about AI, the dumber your uses of GPT-3 were, because you 'knew' that it couldn't do anything and you had to hold its hand to do everything and you had to phrase everything in baby talk etc. You had to unlearn everything you knew and anthropomorphize it to meaningfully explore prompting. This requires a certain flexibility of mind that has less to do with IQ and more to do with, say, schizophrenia -the people in Cyborgism, who do the most interesting things with LLMs, are not extraordinarily intelligent. They are, however, kinda weird and crazy.

gwern85

I'm not sure we want something beyond statistical range of human personality traits

Obviously it is untrue that editing is useless if it 'only' gives you a von Neumann. Similarly for personality. We don't reify sets of personality traits as much as IQ, which is more obvious, but there are definitely many people who achieved remarkable things through force of personality. (Think figures like Lee Kuan Yew or Napoleon or Elon Musk come to mind as an example: they were smart, and lucky, and made good choices, but there is clearly still a lot left over to explain.) And because personality is many things and there seems to be a pipeline model of output, you quickly get very few people at the tails who assemble all the right components. (Gignac has a paper making this point more explicitly.)

Why do not select outliers from population using personality testing and give them high intelligence?

You're acting like it's uncontroversially true that you have unlimited edits and can change any property at any time in development. I don't think that is the case.* There is going to be an editing budget and limits to editing. One might as well ask the opposite question: why not select intelligence outliers from the population and give them high personality traits? (Well, to know you don't want to do that, you would have to have some idea of how well personality editing would work - which we don't. That's my point!)

* Actually, the whole adult thing is a bit of a red herring. I believe even OP has largely abandoned the idea of adult editing and gone back to embryo-based approaches...? This is just a convenient place to drop my comment about uses of editing which will matter more over the next 30 years.

gwern127

I think you would probably be downvoted because you have already admitted to writing poorly thought out ignorant comments under conditions conducive to arrogance and bad judgment, of which you are apparently unashamed and feel no need to rectify (eg. by refraining from commenting until you are recovered), while dragging in unrelated claims which are seriously problematic like uncritical belief in Dunning-Kruger as a thing or claiming that anyone is touting 'IQ over WAIS' (WAIS... like, the IQ test WAIS?) or apparently believe in things like multiple intelligences, and your comments are littered with mockery, spelling errors, and grandiose generalizations writing checks that you don't remotely come close to cashing. (Saying you've definitely seen data, trust me bro, but you can't remember where, and everyone should just go google it themselves, is not a convincing argument.)

If you are going to comment on my serious writings - and in my shortform posts, not yours - I would greatly appreciate it if you could do so on more than 2 hours of sleep, and confine your comments to the object level I am writing about (instead of jumping to the meta-level about how these exemplify the errors of this community of blind sheep that only you are enlightened enough to perceive and explain to them - if only they would not reject your message). I would also suggest reading more MoR and less Attack on Titan, and in general identifying less with fictional characters.

gwern423

Thinking about this post these days... Editing discussions might be better focused on personality: is that feasible, statistically? It seems like it might be, but we don't know.

The focus on IQ in older discussions strikes me as increasingly misguided. It's a good trait to start with, because it is important, well-studied, and turns out to be highly tractable, but it should only be a stepping stone to more useful approaches like index scores. There's also another reason to treat IQ as just a toy example: we are now well into the deep learning revolution, and it's come so far, and there's so much scope for scaling & improvement, that it seems like IQ is plummeting in value each year. Already it feels like people get less or more out of AI based on their flexibility and willingness to experiment or to step back & delegate & finish missing pieces. When the LLMs can do all the smart things you ask them to do, the value becomes in asking for good ones, and making good use of them. The future doesn't seem like it'll be kind to neurotic, eager-to-please types, but good to those who are unafraid to have clear visions or know what they want, finish projects and - pace Amdahl's law - make themselves as little of a rate-limiting step as possible.* That is, if you ask, what would be good to edit for, beyond narrow health traits, it seems like the answer is not (just) IQ but non-cognitive traits like Openness or Conscientiousness or (dis?)Agreeableness. So, you should probably start skating towards that puck yesterday.

Problem is, the personality GWASes, last I checked several years ago, were terrible. The PGS % is ~0%, and the GCTAs or LDSC (common SNP heritabilities) not much better, from UK Biobank in particular. The measurements of Big Five seem normal, and the sample sizes seem good, so it doesn't seem like a mere statistical power or measurement error issue. What gives? GREML-KIN suggests that a good chunk of it may be rare variants, but the situation is still not great:

For neuroticism the final model consisted of contributions from the variance components G and K. Additive common genetic effects explained 11% (SE = 2%) of the variance with pedigree-associated variants explaining an additional 19% (SE = 3%). Whereas none of the environmental components were statistically-significant, the family component accounted for 2% of the variance in the full model and 1% in a model that included only the G and the K in addition to F.

For extraversion the only detectable source of genetic variation came from the G, which accounted for 13% (SE = 2%), with F explaining a further 9% (SE = 1%) of the phenotypic variation. The lack of pedigree-associated genetic effects could be due to low statistical power, as K explained 5% of the variance in the full model and 6% in a GKF model, but with a relatively large SE, estimated at 5%.

This is despite personality traits often clearly being highly heritable, easily 50% (and Neuroticism/Extraversion might even be the best case scenarios for Big Five here - Openness might pick up mostly IQ/EDU, and C/A a wash). And this is consistent with some evolutionary scenarios like frequency-dependent selection, where personality is seen as a kind of knob on various things like risktaking, where there cannot be any kind of universal a priori optimal level of risktaking. So simple additive variants will tend to systematically push organisms 'too high (low)' and be maladaptive, and fixate, leaving only weirder stuff which has less average effect, like dominance or epistasis. Which is very bad because from what I recall of formal modeling of the statistical power of GWASes for detecting & estimating specific nonlinear variants, the situation is dire. Estimating combinatorially many interactions across millions of common & rare variants, if we want to maintain the standard genome-wide false positive rate, means that we will have to adjust for all the tests/comparisons we'll run, and that is going to push the sample sizes up from the current feasible millions to possibly hundreds of millions or even billions. (Andrew Gelman's rule of thumb is that an interaction requires 16x more data, and that's for the simplest easiest case, so...)

So, this looks pretty bad for any kind of selection process. Rare variants are more expensive to WGS/impute per embryo, they are far more data-expensive to estimate, the sheer rareness means even when estimated they are not useful for selection, and then they turn out to be ceilined at like 13% or 30% for all variants (as opposed to 50% for IQ, with most o that from easy common variants).

Is it bad for editing? Well... maybe?

Editing is hard for IQ, under mutation-selection balance, because large (negative) effects get selected away quicker than small ones. So all that's left is a ton of little bits of grit in the gears, to be edited away one by one, like picking up sand with tweezers.

But maybe that's not true of personality? The effect sizes could be relatively large, because the nonlinear effects are mostly invisible to selection. And then for the purposes of editing, rather than prediction/selection, maybe the situation isn't so dire. We would only need to 'set' a few discrete combinations of genes appropriately to potentially get a large personality difference.

And in that case, we don't need to pass a statistical-significance threshold. (This is often the case when we pass from a naive NHST approach to a decision-relevant analysis.) We might only need a reasonable posterior probability for each 'setting', and then we can edit a bunch of them, and get a large effect. If we are wrong, then almost by definition, our edits will average out to no effect on personality.

Is this the case? I dunno. Discussion of the non-additive variants is usually done from the standard GWAS and behavioral genetics perspectives of either maximizing the variance explained of a PGS, or compartmentalizing between variance components. Neither one directly addresses this question.

It seems like it wouldn't be hard for a grad student or someone to dig into the existing literature and get some idea of what the implied distribution of effect sizes for personality is, and what the sample size requirements would be, and how that translates into the edit-count vs change curve. Even if not used in humans, it'd be useful to understand the plasticity of personality, and could potentially be applied to, say, animal welfare in more rapidly adjusting animals to their conditions so they suffer less.

* This would be even more true of things like 'taste' or 'creativity', but if we can't do gross personality traits like Extraversion, anything subtler is clearly off the table, no matter how much more important it will become.

gwern102

As far as the conditioning goes, Habryka showed me some base model outputs with conditioning on karma/agreement and there turns out to be an EDT-like problem with LW-style comments when you condition on high values - often, a high-scoring LW comment will include strong empirical evidence like personal experience or citations, which would be highly convincing indeed... if it were true, rather than confabulated.

So if you sampled a response to your new post about "X might be helpful", then a high-value conditioning might generate a counter-comment from "Gwern" like "I've tried X over 100 times and it never worked!" You can see the problem with that. It's not the 'kneejerk prejudices', it's the self-fulfilling prophecies of sampling based on previously sampled tokens which bootstrap strong but false claims. (If that were true, if I had tried X over 100 times and it never worked, that would be a very valuable and important comment for me to make on your new post about X, and it would be highly upvoted etc. It's just that the LLM has no way of knowing that and it's almost certainly not true, especially if X is some new idea that no one else could've even tried yet.)

The confabulation problem here seems especially bad because we value empirical grounding so much, and that is something base LLMs are poor at. (The chatbots are much better, but problematic in all the other ways.) It's not obvious how to condition for good comments which avoid confabulation issues and either solely refer to pre-existing published comments or pure reasoning/general-knowledge responses.

So the karma/agreement conditioning idea might not work out in practice compared to just sampling random values, or something more complex, like generating n comments at each possible combination of levels, and presenting the grid, or perhaps then feeding them back in to select the 'best' one in some sense.

gwern105

If that was your first statement, then there is a whiff of 'damning with faint praise'.

"So, how was the big wedding?" "...well, the couple clearly loves each other very much." "...I see. That bad, huh."

Load More