Calibration is a super important signal of quality because it means you can actually act on the given probabilities! Even if someone is gaming calibration by betting given ratios on certain outcomes, you can still bet on their predictions and not lose money (often). That is far better than other news sources such as tweets or NYT or whatever. If a calibrated predictor and a random other source are both talking about the same thing, the fact that the predictor is calibrated is enough to make them the #1 source on that topic.
Some compute dependent advancements are easier to extrapolate from small scale than others. For instance, I strongly suspect that small scale experiments + naively extrapolating memory usage is sufficient to discover (and be confident in) GQA. Note that the gpt-4 paper predicted the performance of gpt-4 from 1000x scaled down experiments! The gpt-4 scaling law extrapolation, and similar scaling laws work, is proof that a lot of advances can be extrapolated from much smaller compute scale.
Gpt-4.1 is an expecially soulless model. It's intended for API use only, whereas chatgpt-latest is meant to chat with humans. It's not as bad as o1-mini - that model is extremely autistic and has no concept of emotion. This would work much better with ~pretrained models. Likely you can get gpt-4-base or llama 405b base to do much better with just prompting and no RL.
I agree frontier models severely lack spatial reasoning on images, which I attribute to a lack of in-depth spatial discussion of images on the internet. My model of frontier models' vision capabilities is that they have very deep knowledge of aspects of images that relate to text that happens to be immediately before or after it in web text, and only a very small fraction of images on the internet have accompanying in-depth spatial discussion. The models are very good at for instance guessing the location of where photos were taken, vastly better than most humans, because locations are more often mentioned around photos. I expect that if labs want to, they can construct enough semi-synthetic data to fix this.
Yeah they may be the same weights. The above quote does not absolutely imply the same weights generate the text and images IMO, just that it's based on the 4o and sees the whole prompt. OpenAI's audio generation is also 'native', but it's served as a separate model on the API with different release dates, and you can't mix audio and some function calling in chatgpt in a way that's consistent with them not actually being the same weights.
I can somewhat see where you're coming from about a new method being orders of magnitude more data efficient in RL, but I very strongly bet on transformers being core even after such a paradigm shift. I'm curious whether you think the transformer architecture and text input/output need to go, or whether the new training procedure / architecture fits in with transformers because transformers are just the best information mixing architecture.