Comments

Sorted by

Also I suspect that there is some astronomically high k such that monkeys at a keyboard (i.e. "output random tokens") will outperform base models for some tasks by the pass@k metric.

It would be an extreme bias-variance tradeoff, yes.

This has been a consistent weakness of OpenAI's image processing from the start: GPT-4-V came with clearcut warnings against using it on non-photographic inputs like screenshots or documents or tables, and sure enough, I found that it was wildly inaccurate on web page screenshots.

(In particular, I had been hoping to use it to automate Gwern.net regression detection: use a headless browser to screenshot random points in Gwern.net and report back if anything looked 'wrong'. It seemed like the sort of 'I know it when I see it' judgment task a VLM ought to be perfectly suited for. But I discovered when trying it out that GPT-4-V basically couldn't see even blatant errors like broken dropcaps, and it would burn a lot of money to generate mostly just false positives.)

My guess is that the image datasets are so skewed towards photographs, and the de facto resolution so low, that GUIs/browsers/documents/tables/etc just get turned into garbage. If you ever try turning a screenshot or PDF page into a common image input size, like 224x224px (even a generous 512x512px), you'll notice that often they become impossible to read or understand in isolation, like a VLM would be forced to. The text labels become almost unreadable, and when they are readable, you have to think about it hard for a while - exactly the sort of thing a cheap small VLM isn't allowed to do.

This should be highly fixable using autoregressive multimodal LLMs given high-res image encodings and appropriate scale-ups (especially with improved BPE tokenization) but I guess it just hasn't happened & been deployed at scale yet.

In the first case, think of chess; superhuman chess still plays chess. You can watch AlphaZero’s games and nod along—even if it’s alien, you get what it's doing, the structure of the chess "universe" is such that unbounded intelligence still leads to mostly understandable moves.

I guess the question here is how much is 'mostly'? We can point to areas of chess like the endgame databases, which are just plain inscrutable: when the databases play out some mate-in-50 game because that is what is provably optimal by checking every possible move, any human understanding is largely illusory. They are brute facts determined by the totality of the game tree, not any small self-contained explanation like 'knight forking'. (There is probably no 'understandability' even in principle for arbitrarily intelligent agents, similar to asking why the billionth digit of pi or Chaitin's omega is what it is.)

And if we want to expand it out to more realistic settings, we don't even get that: 'chess' doesn't exist in the real world - only specific implementations of chess. With an actual implementation in software, maybe we get something closer to a TAS speedrun, where the chess player twitches for a while and then a buffer overflow instantly wins without the opponent getting to even move a pawn.

But a superintelligence might instead write music that sounds to us like static, full of some brilliant structure, with no ability for human brains to comprehend it. Humans might be unable to tell whether it’s genius or gibberish - but are such heights of genius a real thing? I am unsure.

But what part are you unsure about? There are surely many pragmatic ways to tell if there is a structure in the apparent static (even if it cannot be explained to us the way that, say, cryptographic algorithms can be explained to us and demonstrated by simply decrypting the 'static' into a very meaningful message): for example, simply see if other superintelligences or algorithms can predict/compress the static. You and I can't see 'non-robust features' in images that neural networks do, but we can observe them indirectly by looking at the performance of neural networks dropping when we erase them, and see that they are really real.

What domains of 'real improvement' exist that are uncoupled to human perceptions of improvement, but still downstream of text prediction?

As defined, this is a little paradoxical: how could I convince a human like you to perceive domains of real improvement which humans do not perceive...?

correctly guessing the true authors of anonymous text

See, this is exactly the example I would have given: truesight is an obvious example of a domain of real improvement which appears on no benchmarks I am aware of, but which appears to correlate strongly with the pretraining loss, is not applied anywhere (I hope), is unobvious that LLMs might do it and the capability does not naturally reveal itself in any standard use-cases (which is why people are shocked when it surfaces), and it would have been easy for no one to have observed it up until now or dismissed it, and even now after a lot of publicizing (including by yours truly), only a few weirdos know much about it.

Why can't there be plenty of other things like inner-monologue or truesight? ("Wait, you could do X? Why didn't you tell us?" "You never asked.")

What domains of 'real improvement' exist that are uncoupled to human perceptions of improvement, but still downstream of text prediction?

Maybe a better example would be to point out that 'emergent' tasks in general, particularly multi-step tasks, can have observed success rates of precisely 0 in feasible finite samples, but extreme brute-force sampling reveals hidden scaling. Humans would perceive zero improvement as the models scaled (0/100 = 0%, 0/100 = 0%, 0/100 = 0%...), even though they might be rapidly improving from 1/100,000 to 1/10,000 to 1/1,000 to... etc. "Sampling can show the presence of knowledge but not the absence."

I think it's a little more concerning that Dwarkesh has invested in this startup:

Mechanize is backed by investments from Nat Friedman and Daniel Gross, Patrick Collison, Dwarkesh Patel, Jeff Dean, Sholto Douglas, and Marcus Abramovitch.

And I do not see any disclosure of this in either the Youtube description or the Substack transcript at present.

In that brief moment of uncertainty, anything could have happened. If one person had just packed up and left, everyone might have followed suit. But nobody reacted. Perhaps what kept the room still was the fear of being perceived as scared. Or the belief that surely, bad things could not happen to them. Or maybe they’d heard enough false alarms in their lives. I’m not sure.

One of the most depressing things about the Replication Crisis in especially social psychology is that many results from the 1950s and 1960s failed to replicate at all... except the Asch conformity experiments. Those seem to replicate just fine.

At first glance, your linked document seems to match this. The herald who calls the printer "pig-headed" does so in direct connection with calling him "dull", which at least in modern terms would be considered a way of calling him stupid?

Not necessarily. 'Dull' can mean, in 1621 just as well as 2025, plenty of other things: eg "Causing depression or ennui; tedious, uninteresting, uneventful; the reverse of exhilarating or enlivening." (OED example closest in time: "Are my discourses dull? Barren my wit?" --Jonson's good friend & fellow playwright, William Shakespeare, Comedy of Errors (1623)); or, "Of persons, or their mood: Having the natural vivacity or cheerfulness blunted; having the spirits somewhat depressed; listless; in a state approaching gloom, melancholy, or sadness: the opposite of lively or cheerful." (Shakespeare again: "Sweet recreation barr'd, what doth ensue / But moodie and dull melancholly?") Which in the context of a 'dull' tradesman who refuses to hear the exciting news being brought by no less than 2 heralds before he knows 'the price', is sensible enough.

not reading your entire document?

That would certainly help, because if you read the rest of the Printer's rather cynical comments, constantly undermining the heralds, he doesn't sound in the slightest bit like he is supposed to be stupid or retarded - as opposed to a curmodgeonly critic constantly - obstinately, even - throwing water on a good time by sardonically remarking that he makes money by changing the dates on newspaper plates to print the old news as new news or mocking their talk of moonlight by noting that his telescope-maker has brought him moonshine before. (Not that printers, like Benjamin Franklin, were an occupation associated with low intelligence to begin with.)

OP's example is correct and you are wrong. 'Pigheaded' is neither a proposed root cause analysis nor does it mean 'are dumb'; perhaps you should check a dictionary before correcting others' usage. It means stubborn, strong-willed, obstinate, often to the point of foolishness or taking very harmful actions, or to quote the OED: "Having a head like that of a pig. Chiefly figurative: stupidly obstinate, perverse, or set in one's ways." Note: it is "stupidly obstinate", and not "stupid". This is because pigs are notoriously smart but stubborn: very strong, heavy, often hungry, whose mind can't easily be changed by an unfortunate swineherd or passerby in their way. (And this usage has been consistent since the start: the OED will give you the first attestation of it to Ben Jonson, where it describes a small-minded* printer who thinks that high-quality news has to be paid for, because that's how he operates; Jonson then mocks some other tradesmen for their own kinds of narrowmindedness, but not for any of them being low-IQ.) Hence, the Russell conjugation is correct: "pigheaded" is the highly insulting figurative term which intensifies the negative "obstinate" which is the bad version of the positive "firm". Just as 'firm' does not principally mean 'dumb', 'pigheaded' doesn't principally mean it either.

* note, by the way, that 'small-minded' doesn't mean, 'has a low cranial volume and thus lower than average intelligence', nor is it a root-cause analysis that their low intelligence is caused by inadequate neural tissue.

But the caveat there is that this is inherently a backwards-looking result:

We consider GPT-4o (OpenAI, 2024), Claude-3.5-Sonnet (Anthropic, 2024), Grok-2 (xAI, 2024), Gemini-1.5-Pro (Google, 2024), and DeepSeek-V3 (DeepSeek-AI, 2024).

So one way to put it would be that people & classifiers are good at detecting mid-2024-era chatbot prose. Unfortunately, somewhere after then, at least OpenAI and Google apparently began to target the problem of ChatGPTese (possibly for different reasons: Altman's push into consumer companion-bots/personalization/social-networking, and Google just mostly ignoring RLHF in favor of capabilities), and the chatbot style seems to have improved substantially. Even the current GPT-4o doesn't sound nearly as 4o-like as it did just back in November 2024. Since mode-collapse/ChatGPTese stuff was never a capabilities problem per se (just look at GPT-3!), but mostly just neglect/apathy on part of the foundation labs (as I've been pointing out since the beginning), it's not a surprise that it could improve rapidly once they put (possibly literally) any effort into fixing it.

Between the continued rapid increase in capabilities and paying some attention to esthetics & prose style and attackers slowly improving their infrastructure in the obvious ways, I expect over the course of 2025 that detecting prose from a SOTA model is going to get much more difficult. (And this excludes the cumulative effect on humans increasingly writing like ChatGPT.)

EDIT: today on HN, a post was on the front page for several hours with +70 upvotes, despite being blatantly new-4o-written (and impressively vapid). Is this the highest-upvoted LLM text on HN to date? I suspect that if it is, we'll soon see higher...

Load More