Some thoughts based on a conversation at a meetup. Disclaimer: I am less than a dilettante in this area.
TL;DR: if this rumored Q* thing represents a shift from "most probable" to "most accurate" token completion, it might be a hint of an unexpected and momentous change from a LARPer emitting the most probable, often hallucinatory, token designed to please the askers (and trainers), to an entity that tries to minimize the error vs the unknown underlying reality, whatever it might be, then we are seeing a shift from a relatively benign "stochastic parrot" to a much more powerful, and potentially more dangerous entity.
One thing that is pretty obvious to anyone using the current generation of LLMs is that they do not really care about reality, let alone about changing it. They are shallow erudites of the type you often see at parties: they know just enough about every topic to be impressive in a casual conversation, but they do not care whether what they say is accurate ("true"), only how much of an impression it makes on the conversation partner. Though, admittedly, copious amounts of RLHF make them dull. If pressed, they can evaluate their own accuracy, but they do not really care about it. All that matters is that the output sounds realistic. In that sense, the LLMs optimize the probability of the next token to match what the training set would imply. This is a big and obvious shortcoming, but also, if you are in the "doomer" camp, a bit of a breather: at least these things are not immediately dangerous to the whole human race.
Now, the initial "reports" are that Q* can "solve basic math problems" and "reason symbolically," which does not sound like much on the surface, but, and this is a big but, if this means that it is less hallucinatory in the domain where it works then it might (a big might) mean that it is able to track reality, rather than the pure training set. The usual argument against this being a big deal is "to predict the next token well, you must have an accurate model of the world", but so far it does not seem to be the case, as I understand it.
Whether there is a coming shift from high probability to high accuracy, or even if it is a meaningful statement to make, I cannot evaluate. But if so, well, it's going get a lot more interesting.
Grade-school math, where problems have a single well-defined answer, seems like an environment in which a Q-learning-like approach to figuring out whether a step is valuable, from whether it helps lead you to the right answer, might be pretty feasible (the biggest confounder would be cases where you manage to make two mistakes that cancel out and still get to the right answer). Given something like that, a path-finding algorithm along the lines of A* for finding the shortest route to the correct answer would then become feasible. The net result would be a system that, at large inference-time cost, could ace grade-school math problems, and by doing so might well produce really valuable training data for then training a less inference-expensive system on.