Comments

Sorted by
gwern40

But the caveat there is that this is inherently a backwards-looking result:

We consider GPT-4o (OpenAI, 2024), Claude-3.5-Sonnet (Anthropic, 2024), Grok-2 (xAI, 2024), Gemini-1.5-Pro (Google, 2024), and DeepSeek-V3 (DeepSeek-AI, 2024).

So one way to put it would be that people & classifiers are good at detecting mid-2024-era chatbot prose. Unfortunately, somewhere after then, at least OpenAI and Google apparently began to target the problem of ChatGPTese (possibly for different reasons: Altman's push into consumer companion-bots/personalization/social-networking, and Google just mostly ignoring RLHF in favor of capabilities), and the chatbot style seems to have improved substantially. Even the current GPT-4o doesn't sound nearly as 4o-like as it did just back in November 2024. Since mode-collapse/ChatGPTese stuff was never a capabilities problem per se (just look at GPT-3!), but mostly just neglect/apathy on part of the foundation labs (as I've been pointing out since the beginning), it's not a surprise that it could improve rapidly once they put any effort into fixing it.

Between the continued rapid increase in capabilities and paying some attention to esthetics & prose style and attackers slowly improving their infrastructure in the obvious ways, I expect over the course of 2025 that detecting prose from a SOTA model is going to get much more difficult. (And this excludes the cumulative effect on humans increasingly writing like ChatGPT.)

gwern82

I'm not sure this is a big problem. How much net attrition do you really expect over a decade, say? By which point who really cares? You will have so much more AI progress, and accumulated data (particularly if you've been gradually replacing the lower-level employees and you have an 'automation wave' moving through the organization where employees increasingly train their automated replacements or their job is simply reorganizing the jobs to enable automation).

It seems like to the extent there's much attrition at high levels, it is reduced in considerable part by these very dynamics: as returns to high-level human labor go up, presumably, there is less attrition from voluntary retirement or leisure consumption (and if the returns go down, then that implies that there is no 'shortage' of people for such high-level positions and so no problem); and also as the remaining human work becomes more 'white-collar' and based on difficult-for-AI things like reputation or experience or ownership or creativity, aging or opportunity costs begin to matter less, reducing another source of attrition.

(Even if AI or robotics is unable to do the 'core' of a job, they can help deal with various obstacles which might prevent a human from doing the job. An elderly manager who might decide to retire in part because they are low-key becoming worried about safely driving to/from the office will no longer think about that when they have a self-driving car or remote working becomes ever more feasible; older managers who might be slipping in their grasp of details or who have 'senior moments' will be able to rely on AI secretaries to catch those or just pause stuff for a while until they're back to normal; elite women might invest more in careers if they have Claude-bot as a trustworthy nanny and chauffeur, etc. One is reminded of President Biden: his staffers were able to work around his issues by doing things like rescheduling or canceling events to avoid exposing him publicly when he was bad; it was only an event that even the POTUS can't arbitrarily schedule, a presidential debate, that punctured the carefully-constructed illusion. Few of those staffers were qualified to be President of the United States, and yet, you don't have to be a good president to observe "sounds like Joe's having a bad day today" and quietly cancel his evening appointments for him so he can get to bed early.)

gwern100

Also notable: the big OpenAI reveal today was some sort of better personalization. Instead of the crude 'saved facts' personalization ChatGPT has had for a long time and which has never made much of a difference, they're doing... something. Unclear if it's merely RAG or if they are also doing something interesting like lightweight finetuning. But the GPTs definitely seem to have much better access to your other sessions in the web interface, and as far as I know, few other interfaces with frontier models have tried to do much personalization, so this will be an interesting real-world test at scale about how much simple personalization can help with LLMs (similar to Midjourney's relatively new personalization feature, which I get a lot out of).

gwern20

I don't think this is true at all. How do you translate, say, rotating multiple shapes in parallel into text?

At least for multimodal LLMs in the pure-token approach like Gato or DALL-E 1 (and probably GPT-4o and Gemini, although few details have been published), you would be able to do that by generating the tokens which embody an encoded image (or video!) of several shapes, well, rotating in parallel. Then you just look at them.

gwern42

Pursuit of novelty is not vnm-incoherent. Furthermore, it is an instrumentally convergent drive; power-seeking agents will seek novelty as well, because learning increases power in expectation (see: value of information).

Or to put it another way: any argument which convincingly proves that 'incoherent search processes ultimately outcompete coherent search processes' is also an argument which convinces a VNM agent to harness the superior incoherent search processes instead of the inferior coherent ones.

gwern50

It's good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.

gwern2811

One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity's Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio). It's a very consistent finding for many years now that pretraining loss is just about the best estimate there is of capability. It is also intrinsically robust to overfitting or benchmark cheating, you can automatically collect huge amounts of new data, and it's hard to sandbag (indeed, how could a model know if it's in pretraining or testing if you are simply handing it a new datapoint scraped from the Internet using the same pipeline and measuring a log-prob?).

I'm a little surprised that we don't already have a continual compression benchmark anywhere. It seems a lot easier to do than most of the benchmarks out there. And while you may object that the most important LLMs to benchmark either don't provide log-probs or the log-probs are meaningless, there are multiple ways to elicit token predictions which would let you estimate log-probs, and then you just need to sample enough to estimate the net BPC to satisfactory precision: https://www.reddit.com/r/mlscaling/comments/1ju1q2e/compression_represents_intelligence_linearly/mlyypmx/

gwern141

No, it would probably be a mix of "all of the above". FB is buying data from the same places everyone else does, like Scale (which we know from anecdotes like when Scale delivered FB a bunch of blatantly-ChatGPT-written 'human rating data' and FB was displeased), and was using datasets like books3 that are reasonable quality. The reported hardware efficiency numbers have never been impressive, they haven't really innovated in architecture or training method (even the co-distillation for Llama-4 is not new, eg. ERNIE was doing that like 3 years ago), and insider rumors/gossip don't indicate good things about the quality of the research culture. (It's a stark contrast to things like Jeff Dean overseeing a big overhaul to ensure bit-identical reproducibility of runs and Google apparently getting multi-datacenter training working by emphasizing TPU interconnect.) So my guess is that if it's bad, it's not any one single thing like 'we trained for too few tokens' or 'some of our purchased data was shite': it's just everything in the pipeline being a bit mediocre and it multiplying out to a bad end-product which is less than the sum of its parts.

Remember Karpathy's warning: "neural nets want to work". You can screw things up and the neural nets will still work, they will just be 1% worse than they should be. If you don't have a research culture which is rigorous about methodology or where people just have good enough taste/intuition to always do the right thing, you'll settle for whatever seems to work... (Especially if you are not going above and beyond to ensure your metrics aren't fooling yourself.) Now have a 1% penalty on everything, from architecture to compute throughput to data quality to hyperparameters to debugging implementation issues, and you wind up with a model which is already obsolete on release with no place on the Pareto frontier and so gets 0% use.

gwern32

That would be tricky because you are comparing apples and oranges. Consider that for the USA, there are only 11 cardinals (of 252 worldwide), while there are 10x more federal senators at any moment (I don't know if there would be more or less total: senators tend to be much younger but cardinals also tend to be long-lived), and I can't even guess how many 'Fortune 500 C-level employees' there might be given corporate turnover and the size of many 'C-suites' - tens of thousands, maybe? So your suggestions span ~1-3 orders of magnitude less selectivity than cardinals do.

gwern60

whomever makes it into the college of cardinals.

I would be surprised if that was the primary homosexuality-enriching step, given that reporting has always been that quite a lot of low-level parish-level priests are also gay. (Note, for example, how many of the sexual abuse scandal victims were boys/men.) I would guess that it operates fairly steadily at all levels, starting from simply which young boys opt for the priesthood (known to be a demand and difficult occupation even if the celibacy requirement is, for you, not so onerous) and operating from there; if I had to guess where the biggest enrichment is, it'd be at the 'leaving your country for the Vatican' step, given how notoriously gay the Vatican is. So going there suggests either that you are gay (and so the buggery isn't a bug, it's a feature) or you are highly ambitious and don't mind it (or are willing to exploit it and again, not a bug but a feature).

Load More