If GPT-4.5 was supposed to be GPT-5, why would Sam Altman underdeliver on compute for it? Surely GPT-5 would have been a top priority?
If it's not obvious at this point why, I would prefer to not go into it here in a shallow superficial way, and refer you to the OA coup discussions.
GPT-4.5 is roughly a 10x scale-up of GPT-4, right? And full number jumps in GPT have always been ~100x? So GPT-4.5 seems like the natural name for OpenAI to go with.
10x is what it was, but it wasn't what it was supposed to be. That's just what they finally killed it at, after the innumerable bugs and other issues that they alluded to during the livestream and elsewhere, which is expected given the 'wait equation' for large DL runs - after a certain point, no matter how much you have invested, it's a sunk cost and you're better off starting afresh, such ...
at that time the median estimate for GPT5 release was at December 2024.
Which was correct ex ante, and mostly correct ex post - that's when OA had been dropping hints about releasing GPT-4.5, which was clearly supposed to have been GPT-5, and seemingly changed their mind near Dec 2024 and spiked it before it seems like the DeepSeek moment in Jan 2025 unchanged their minds and they released it February 2025. (And GPT-4.5 is indeed a lot better than GPT-4 across the board. Just not a reasoning model or dominant over the o1-series.)
which was clearly supposed to have been GPT-5
I have seen people say this many times, but I don't understand. What makes it so clear?
GPT-4.5 is roughly a 10x scale-up of GPT-4, right? And full number jumps in GPT have always been ~100x? So GPT-4.5 seems like the natural name for OpenAI to go with.
I do think it's clear that OpenAI viewed GPT-4.5 as something of a disappointment, I just haven't seen anything indicating that they at some point planned to break the naming convention in this way.
GPT was $20/month in 2023 and it's still $20/month.
Those are buying wildly different things. (They are not even comparable in terms of real dollars. That's like a 10% difference, solely from inflation!)
It’s not my view at all. I think a community will achieve much better outcomes if being bothered by the example message is considered normal and acceptable, and writing the example message is considered bad.
That's a strange position to hold on LW, where it has long been a core tenet that one should not be bothered by messages like that. And that has always been the case, whether it was LW2, LW1 (remember, say, 'babyeaters'? or 'decoupling'? or Methods of Rationality), Overcoming Bias (Hanson, 'politics is the mindkiller'), SL4 ('Crocker's Rules') etc.
I ...
I would also point out that, despite whatever she said in 1928 about her 1909 inheritance, Woolf committed suicide in 1941 after extensive mental health challenges which included "short periods in 1910, 1912, and 1913" in a kind of insane asylum, and then afterwards beginning her serious writing career (which WP describes as heavily motivated by her psychiatric problems as a refuge/self-therapy), so one can certainly question her own narrative of the benefits of her UBI or the reasons she began writing. (I will further note that the psychological & psy...
Update: Bots are still beaten by human forecasting teams/superforecasters/centaurs on truly heldout Metaculus problems as of early 2025: https://www.metaculus.com/notebooks/38673/q1-ai-benchmarking-results/
A useful & readable discussion of various methodological problems (including the date-range search problems above) which render all forecasting backtesting dead on arrival (IMO) was recently compiled as "Pitfalls in Evaluating Language Model Forecasters", Paleka et al 2025, and is worth reading if you are at all interested in the topic.
Personality traits are especially nasty a danger because given the existence of: stabilizing selection + non-additive variance + high social homogamy/assortative mating + many personality traits with substantial heritability, you can probably create extreme self-sustaining non-coercive population structure with a package of edits. I should probably write some more about this because I think that embryo selection doesn't create this danger (or in general result in the common fear of 'speciation'), but embryo editing/synthesis does.
Key lesson:
One conclusion we have drawn from this is that the most important factor for good forecasting is the base model, and additional prompting and infrastructure on top of this provide marginal gains.
Scaling remains undefeated.
It'd be a lot easier to check claims here if you included the original hyperlinks (or in the handful of cases that a URL is provided, made it clickable).
If you know what to search for, you can dig out that old post. Of course, leaving memorable breadcrumbs you can search for three years later is, at best, an art
Yes, that has been my experience too. Sure, Discord (like Twitter) gives you fairly powerful search primitives, to a greater extent than most people ever notice. You can filter by user, date-ranges, that sort of thing... It was written by nerds for nerds, originally, and it shows. However, I have still struggled to find many older Discord comments by myself or others, because it is inherent to th...
“buying vegetables they didn't need” doesn’t make any sense. Either nobody needs vegetables or everybody does; they’re healthy but not necessary to stay alive.
On Tuesday at Esmeralda in California, I watched a lot of people just like the protagonists at the farmer's market buying vegetables they didn't need. (I bought a sourdough loaf which I did need, and ate it.) At the house I'm staying at, I just got buzzed by the fly from the vegetables that the house renters bought which they didn't need. (Cherry tomatoes, if you were wondering.) It makes perfect sen...
What splits do you have in mind which are so much more often happening than mergers? We just saw Scale merge into FAIR, and not terribly long before that, Character.ai returned to the mothership, while Tesla AI de facto merged into Xai and before that Adept merged into Amazon and Inflection into Microsoft etc, in addition to the de facto 'merges' which occur when an AI lab quietly drops out and concedes the frontier (eg Mistral) or where they pivot to opensource as a spoiler or commoditize your complement play. So I see plenty of merging, consistent with t...
I would also point out a perverse consequence of applying the rate limiter to old high-karma determined commenters: because it takes two to tango, a rate-limiter necessarily applies to the non-rate-limited person almost as much as the rate-limited person...
You know what's even more annoying than spending some time debating Said Achmiz? Spending time debating him when he pings you exactly once a day for the indefinite future as you both are forced to conduct it in slow-motion molasses. (I expect it is also quite annoying for anyone looking at that page, or the site comments in general.)
...And yet... lukeprog hasn't been seriously active on this site for 7 years, Wei Dai hasn't written a post in over a year (even as he engages in productive discussions here occasionally), Turntrout mostly spends his time away from LW, Quintin Pope spends all his time away from LW, Roko comments much less than he used to more than a decade ago, Eliezer and Scott write occasional comments once every 3 months or so, Richard Ngo has slowed down his pace of posting considerably, gwern posts here very infrequently (and when he does, it's usually just linking to o
Wait I don't think @gwern literally pastes this into the LLM? "Third parties like LLMs" sounds like "I'm writing for the training data".
That actually is the idea for the final version: it should be a complete, total guide to 'writing a gwernnet essay' written in a way comprehensible to LLMs, which they can read in a system prompt, a regular prompt, or retrieve from the Internet & inject into their inner-monologue etc. It should define all of the choices about how to markup stuff like unique syntax (eg. the LLMs keep flagging the interwiki links as s...
use a trick discovered by Janus to get Claude Opus 4 to act more like a base model and drop its “assistant” persona
Have you or Janus done anything more rigorous to check to what extent you are getting 'the base model', rather than 'the assistant persona pretending to be a base model'? This is something I've noticed with jailbreaks or other tweaks: you may think you've changed the bot persona, but it's really just playing along with you, and will not be as good as a true base model (even if it's at least stylistically superior to the regular non-roleplay...
These ancient F1 drivers sound like a good contrast to the NBA stats presented: if it was simply wealth/success, shouldn't there be a 'pyjama effect' there too? F1 drivers get paid pretty well too.
...Most players don’t survive very long. Over a third don’t make it past two years. The average person lasts five years. The odds of making it to the ten year mark is less than 25%. This curve is starkly contrasted to most modern jobs, with the median teacher career for example lasting over 25 years.
...Once the wealth and fame pile up, the grind feels optional. Yo
I wonder why no-one has just directly tried to do turing debate, where the debaters submit ~2000 words that explain their views to each other beforehand, then the actual debate is them taking on the position of the other side and trying to debate that.
One idea might be to pair debates with Delphi panels: do the usual Delphi method to get a consensus report beforehand, and then have them explain & debate what is left over as non-consensus (or possibly, if there are some experts who disagree hotly with the consensus report, bring them on for a debate with the original panel).
First, I didn't say it wasn't communicating anything. But since you bring it up: it communicated exactly what jefftk said in the post already describing the scene. And what it did communicate that he didn't say cannot be trusted at all. As jefftk notes, 4o in doing style transfer makes many large, heavily biased, changes to the scene, going beyond even just mere artifacts like fingers. If you don't believe that people in that room had 3 arms or that the room looked totally different (I will safely assume that the room was not, in fact, lit up in tastefully...
Seems similar to the "anti-examples" prompting trick I've been trying: taking the edits elicited from a chatbot, and reversing them to serve as few-shot anti-examples of what not to do. (This would tend to pick up X-isms.)
One obvious reason to get upset is how low the standards of people posting them are. Let's take jefftk's post. It takes less than 5 seconds to spot how lazy, sloppy, and bad the hands and arms are, and how the picture is incoherent and uninformative. (Look at the fiddler's arms, or the woman going under 2 arms that make zero sense, or the weird doors, or the table which seems to be somehow floating, or the dubious overall composition - where are the yellow fairy and non-fairy going, exactly?, or the fact that the image is the stereotypical cat-urine yellow of all 4o images.) Why should you not feel disrespected and insulted that he was so careless and lazy to put in such a lousy, generic image?
I think that's exactly how it goes, yeah. Just free association: what token arbitrarily comes to mind? Like if you stare at some static noise, you will see some sort of lumpiness or pattern, which won't be the same as what someone else sees. There's no explaining that at the conscious level. It's closer to a hash function than any kind of 'thinking'. You don't ask what SHA is 'thinking' when you put in some text and it spits out some random numbers & letters. (You would see the same thing if you did a MLP or CNN on MNIST, say. The randomly initialized ...
...It is not clear how the models are able to self-coordinate. It seems likely that they are simply giving what they believe would be the most common answer the same way a group of humans might. However, it is possible the models are engaging in more sophisticated introspection focussing on how they specifically would answer. Follow-up investigations could capture models’ chain of thought as well as tweak the prompt to indicate that the model should strive to be consistent with an answer a human might give or another company’s AI model might give. Circuit-tr
Yes, a NN can definitely do something like know if it recognizes a datapoint, but it has no access to the backwards step per se. Like take my crashing example: how, while thinking in the forward pass, can it 'know' there will be a backward pass when there might be no backward pass (eg because there was a hardware fault)? The forward pass would appear to be identical in every way between the forward pass that happens when there is a backward pass, and when the backward pass doesn't happen because it crashed. At best, it seems like a NN cannot do more than s...
...You can look this up on knowyourmeme and confirm it, and I've done an interview on the topic as well. Now I don't know much about "improving public discourse" but I have a long string of related celebrity hoaxes and other such nonsense which often crosses over into a "War of the Worlds" effect in which it is taken quite seriously...I have had some people tell me that I'm doing what you're calling "degrading the public discourse," but that couldn't be farther from the truth. It's literature of a very particular kind, in fact. Are these stories misinterpret
There are some use-cases where quick and precise inference is vital: for example, many agentic tasks (like playing most MOBAs or solving a physical Rubik's cube; debatably most non-trivial physical tasks) require quick, effective, and multi-step reasoning.
Yeah, diffusion LLMs could be important not for being better at predicting what action to take, but for hitting real-time latency constraints, because they intrinsically amortize their computation more cleanly over steps. This is part of why people were exploring diffusion models in RL: a regular bidir...
This post is an example of my method. Over the last 1-2 years, I’ve made heavy use of AIs, lately DeepSeek and Claude. I do the same with them: present my ideas, deal with their criticisms and objections—whether to correct them or take correction myself—until we’re agreed or the AI starts looping or hallucinating. So, when I say I have yet to hear, after all this time, credible, convincing arguments to the contrary, it’s after having spent the time and done the work that most people don’t even attempt.
Or, to put it less flatteringly, "I harangue the mos...
I think there are many ways that a LLM could have situated awareness about what phase it is in, but I'm not sure if the gradient descent itself is a possibility?
While a NN is running the forward pass without any backprop, it is computing exactly the same thing (usually) that it would be computing if it was running a forward pass before a backwards pass to do a backprop. Otherwise, the backprop can't really work - if it doesn't see the 'real' forward pass, how does it 'know' how to adjust the model parameters to make the model compute a better forward pass ...
I agree it is poorly written, but I don't think it is, strictly speaking, 'LLM slop'. Or if it is, it's not an LLM I am familiar with, or is an unusual usage pattern in some way... It's just not written with the usual stylistic tics of ChatGPT (4o or o3), Claude-3/4, Gemini-2.5, or DeepSeek-r1.
For example, he uses a space after EM DASH but not before; no LLM does that (they either use no space or both before-after); he also uses '1) ' number formatting, where LLMs invariably use '1. ' or '#. ' proper Markdown (and generally won't add in stylistic redundanc...
It also sounds like a piece of paper, or a map, or a person having vivid hallucinations before falling asleep. But unless you have a whiteboard which can be copied among several hundred people and teleport and be rolled up and fit in a jean pocket, which lets you timetravel so you can look at what used to be on the whiteboard or look at what people might write on it in the future, or 'a whiteboard' which is neither white (because there's a colored map printed on it) nor 'a board' (because it's arbitrarily many), which has a ledgerbook next to itself which writes itself, and so on, I would suggest that this does not 'sound like a whiteboard' to most people. (No, not even a Biblically-accurate whiteboard.)
Yes, there's a lot of computer-related ones depending on how finegrained you get. (There's a similar issue with my "Ordinary Life Improvements": depending on how you do it, you could come up with a bazillion tiny computer-related 'improvements' which sort of just degenerates into 'enumerating every thing ever involving a transistor in any way' and is not enlightening the same way that, say, 'no indoors smoking' or 'fresh mango' is.) So I would just lump that one under 'Machine Configuration/Administration § Software' as one of the too-obvious-to-be-worth-mentioning hacks.
How did you check Claude's claims here?
Idea: "Conferences as D&D tabletops": you may be able to better organize a conference or convention by borrowing a tool from tabletop roleplaying games - players collaborate by directly manipulating or modifying a 2D map. It seems to me like this could be low-friction and flexibly handles a lot of things that existing 'conware' design patterns don't handle well.
I have not done any work directly on it. The LLMs have kept improving so rapidly since then, especially at coding, that it has not seemed like a good idea to work on it.
Instead, I've been thinking more about how to use LLMs for creative writing or personalization (cf. my Dwarkesh Patel interview, "You should write more online"). To review the past year or two of my writings:
So for example, my meta-learning LLM interviewing proposal is about how to teach a LLM to ask you useful questions about your psychology so it can better understand & personalize
I was trying out a hierarchical approach when I stopped, because I wasn't sure if I could trust a LLM to rewrite a whole input without dropping any characters or doing unintended rewrites, and aside from being theoretically more scalable and potentially better by making each step easier and propagating the sorting top-down, if you explicitly turn it into a tree, you can easily check that you get back an exact permutation of the list each time and so that the rewrite was safe. I think that might be unnecessary at this point, given the steady improvement in prompt adherence, so maybe the task is now trivial.
There's no explicit distances calculated: just asking the LLM to sort the list meaningfully.
Very funny, but the OA embeddings were always bad at sentence embedding, specifically, compared to other NN sentence-specialized embeddings; and as the original OA embedding paper somewhat defensively argues, it's not even clear a priori what a sentence embedding should do because a sentence is such a cut-down piece of text, and doing well at a sentence embedding task may only be overfitting or come at the cost of performance on more meaningful text embedding tasks. (Similar to a word embedding: they are so poly-semantic or context-dependent that it seems ...
Yeah, it's limited by what kind of structure you have. It did seriate your list successfully, sounds like, it's just you have a lot of structure in the list that you don't care about, and so no embedding is going to prioritize the other stuff and the distances aren't useful to you in general. This will hurt any embedding-related use-case, not just seriation - presumably your k-NN lookups aren't terribly useful either and they mostly just pull up hits which have superficial syntactic similarities.
This is probably less of a problem with my annotations becaus...
As I've said before, I think you greatly overrate the difficulty of putting search into neural nets, and this is an example of it. It seems to me like it is entirely possible to make a generic LLM implement an equivalent to AlphaZero and be capable of expert iteration, without an elaborate tree scaffolding. A tree search is just another algorithm which can be reified as a sequence, like all algorithms (because they are implemented on a computer).
All AlphaZero is, is a way of doing policy iteration/Newton updates by running a game state forward for a few pl...
My earlier commentary on what I think note-taking tools tend to get wrong: https://gwern.net/blog/2024/tools-for-thought-failure
Here is another way to defend yourself against bot problems:
Turned out to be fake, BTW. His friend just pranked him.
for text, you might realize that different parts of the text refer to each other, so need a way to effectively pass information around, and hence you end up with something like the attention mechanism
If you are trying to convince yourself that a Transformer could work and to make it 'obvious' to yourself that you can model sequences usefully that way, it might be a better starting point to begin with Bengio's simple 2003 LM and MLP-Mixer. Then Transformers may just look like a fancier MLP which happens to implement a complicated way of doing token-mixin...
Or just clipped out. It takes 2 seconds to clip it out and you're done. Or you just fast forward, assuming you saw the intro at all and didn't simply skip the first few minutes. Especially as 'incest' becomes universal and viewers just roll their eyes and ignore it. This is something that is not true of all fetishes: there is generally no way to take furry porn, for example, and strategically clip out a few pixels or frames and make it non-furry. You can't easily take a video of an Asian porn star and make them white or black. And so on and so forth.
But if a metric is trivially gameable, surely that makes it sus and less impressive, even if someone is not trivially, or even at all gaming it.
Why would you think that? Surely the reason that a metric being gameable matters is if... someone is or might be gaming it?
Plenty of metrics are gameable in theory, but are still important and valid given that you usually can tell if they are. Apply this to any of the countless measurements you take for granted. Someone comes to you and say 'by dint of diet, hard work (and a bit of semaglutide), my bathroom scal...
Good calibration is impressive and an interesting property because many prediction sources manage to not clear even that minimal bar (almost every human who has not undergone extensive calibration training, for example, regardless of how much domain expertise they have).
Further, you say one shouldn't be impressed by those sources because they could be flipping a coin, but then you refuse to give any examples of 'impressive' sources which are doing just the coin-flip thing or an iota of evidence for this bold claim, or to say what they are unimpressive compared to.
I think I would have predicted that Tesla self-driving would be the slowest
For graphs like these, it obviously isn't important how the worst or mediocre competitors are doing, but the best one. It doesn't matter who's #5. Tesla self-driving is a longstanding, notorious failure. (And apparently is continuing to be a failure, as they continue to walk back the much-touted Cybertaxi launch, which keeps shrinking like a snowman in hell, now down to a few invited users in a heavily-mapped area with teleop.)
I'd be much more interested in Waymo numbers, as that...
The trends reflect the increasingly intense tastes of the highest spending, most engaged consumers.
https://logicmag.io/play/my-stepdad's-huge-data-set/
...While a lot of people (most likely you and everyone you know) are consumers of internet porn (i.e., they watch it but don’t pay for it), a tiny fraction of those people are customers. Customers pay for porn, typically by clicking an ad on a tube site, going to a specific content site (often owned by MindGeek), and entering their credit card information.
This “consumer” vs. “customer” division is key to
This theory feels insufficient to me, or like it's missing a step. It makes sense to me for people to pay when their preferred porn is undersupplied, but incest porn is now abundant. You need a more specific reason incest fans will pay even when they don't have to.
Additionally, "but you're my stepdad" isn't equivalent to a couple of foot shots. Lots of people are (or at least were) turned off by incest.
I think aside from the general implausibility of the effect sizes and the claimed AI tech (GANs?) delivering those effect sizes across so many areas of materials, one of the odder claims which people highlighted at the time was that supposedly the best users got a lot more productivity enhancement than the worst ones. This is pretty unusual: usually low performers get a lot more out of AI assistance, for obvious reasons. And this lines up with what I see anecdotally for LLMs: until very recently, possibly, they were just a lot more useful for people not very good at writing or other stuff, than for people like me who are.