Some thoughts based on a conversation at a meetup. Disclaimer: I am less than a dilettante in this area.
TL;DR: if this rumored Q* thing represents a shift from "most probable" to "most accurate" token completion, it might be a hint of an unexpected and momentous change from a LARPer emitting the most probable, often hallucinatory, token designed to please the askers (and trainers), to an entity that tries to minimize the error vs the unknown underlying reality, whatever it might be, then we are seeing a shift from a relatively benign "stochastic parrot" to a much more powerful, and potentially more dangerous entity.
One thing that is pretty obvious to anyone using the current generation of LLMs is that they do not really care about reality, let alone about changing it. They are shallow erudites of the type you often see at parties: they know just enough about every topic to be impressive in a casual conversation, but they do not care whether what they say is accurate ("true"), only how much of an impression it makes on the conversation partner. Though, admittedly, copious amounts of RLHF make them dull. If pressed, they can evaluate their own accuracy, but they do not really care about it. All that matters is that the output sounds realistic. In that sense, the LLMs optimize the probability of the next token to match what the training set would imply. This is a big and obvious shortcoming, but also, if you are in the "doomer" camp, a bit of a breather: at least these things are not immediately dangerous to the whole human race.
Now, the initial "reports" are that Q* can "solve basic math problems" and "reason symbolically," which does not sound like much on the surface, but, and this is a big but, if this means that it is less hallucinatory in the domain where it works then it might (a big might) mean that it is able to track reality, rather than the pure training set. The usual argument against this being a big deal is "to predict the next token well, you must have an accurate model of the world", but so far it does not seem to be the case, as I understand it.
Whether there is a coming shift from high probability to high accuracy, or even if it is a meaningful statement to make, I cannot evaluate. But if so, well, it's going get a lot more interesting.
It's not obvious that 'uncommon' tokens are good or that that's a good approach.
They could also just be unlikely or garbage, and your screening method for filtering for 'uncommon' tokens may ensure that they are garbage, or otherwise sabotage your model. (This is the 'mammogram screening problem': even if you have a good filter, if you run it across trillions of tokens, you will wind up throwing out many good tokens and keeping many bad tokens. There are a number of LLM-related papers about the horrificly bad data you can wind up compiling if you neglect data cleaning, particularly in multilingual translation when you're trying to scrape rare languages off the general Internet.)
Nor are good datapoints necessarily made up of uncommon tokens: there are zero uncommon tokens in my 'microwave' example.
(Data pruning & active learning are hard.)