There's still my original question of where the feedback comes from. You say keep the transcripts where the final answer is correct, but how do you know the final answer? And how do you come up with the question?
What seems to be going on is that these models are actually quite supervised, despite everyone's insistence on calling them unsupervised RL. The questions and answers appear to be high-quality human annotation instead of being machine generated. Let me know if I'm wrong about this.
If I'm right, it has implications for scaling. You need human annotators to scale, and you need to annotate increasingly hard problems. You don't get to RL your way to infinite skill like alphazero; if, say, the Riemann hypothesis turns out to be like 3 OOMs of difficulty beyond what humans can currently annotate, then this type of training will never solve Riemann no matter how you scale.
I have no opinion about whether formalizing proofs will be a hard problem in 2025, but I think you're underestimating the difficulty of the task ("math proofs are math proofs" is very much a false statement for today's LLMs, for example).
In any event, my issue is that formalizing proofs is very clearly not involved in the o1/o3 pipeline, since those models make so many formally incorrect arguments. The people behind FrontierMath have said that o3 solved many of the problems using heuristic algorithms with wrong reasoning behind them; that's not something a model trained on formally verified proofs would do. I see the same thing with o1, which was evaluated on the Putnam and got the right answer with a wrong proof on nearly every question.
Well the final answer is easy to evaluate. And like in rStar-Math, you can have a reward model that checks if each step is likely to be critical to a correct answer, then it assigns and implied value to the step.
Why is the final answer easy to evaluate? Let's say we generate the problem "number of distinct solutions to x^3+y^3+xyz=0 modulo 17^17" or something. How do you know what the right answer is?
I agree that you can do this in a supervised way (a human puts in the right answer). Is that what you mean?
What about if the task is "prove that every integer can be written as the sum of at most 1000 different 11-th powers"? You can check such a proof in Lean, but how do you check it in English?
And like in rStar-Math, you can have a reward model that checks if each step is likely to be critical to a correct answer, then it assigns and implied value to the step.
My question is where the external feedback comes from. "Likely to be critical to a correct answer" according to whom? A model? Because then you don't get the recursive self-improvement past what that model knows. You need an external source of feedback somewhere in the training loop.
Do you have a sense of where the feedback comes from? For chess or Go, at the end of the day, a game is won or lost. I don't see how to do this elsewhere except for limited domains like simple programming which can quickly be run to test, or formal math proofs, or essentially tasks in NP (by which I mean that a correct solution can be efficiently verified).
For other tasks, like summarizing a book or even giving an English-language math proof, it is not clear how to detect correctness, and hence not clear how to ensure that a model like o5 doesn't give a worse output after thinking/searching a long time than the output it would give in its first guess. When doing RL, it is usually very important to have non-gameable reward mechanisms, and I don't see that in this paradigm.
I don't even understand how they got from o1 to o3. Maybe a lot of supervised data, ie openAI internally created some FrontierMath style problems to train on? Would that be enough? Do you have any thoughts about this?
The value extractable is rent on both the land and the improvement. LVT taxes only the former. E.g. if land can earn $10k/month after an improvement of $1mm, and if interest is 4.5%, and if that improvement is optimal, a 100% LVT is not $10k/mo but $10k/mo minus $1mm*0.045/12=$3,750. So 100% LVT would be merely $6,250.
If your improvement can't extract $6.3k from the land, preventing you from investing in that improvement is a feature, not a bug.
LVT applies to all land, but not to the improvements on the land.
We do not care about disincentivizing an investment in land (by which I mean, just buying land). We do care about disincentivizing investments in improvements on the land (by which I include buying the improvement on the land, as well as building such improvements). A signal of LVT intent will not have negative consequences unless it is interpreted as a signal of broader confiscation.
More accurately, it applies to a signalling of intent of confiscating other investments; we don't actually care if people panic about land being confiscated because buying land (rather than improving it) isn't productive in any way. (We may also want to partially redistribute resources towards the losers of the land confiscation to compensate for the lost investment -- that is, we may want to the government to buy the land rather than confiscate it, though it would be bought at lower than market prices.)
It is weird to claim that the perceived consequence of planned incrementalism is "near-future governments want the money now, and will accelerate it". The actual problem is almost certainly the opposite: near-future governments will want to cut taxes, since cutting taxes is incredibly popular, and will therefore stop or reverse the planned incremental LVT.
Thanks for this post. A few comments:
Calculating these probabilities is fairly straightforward if you know some theory of generating functions. Here's how it works.
Let x be a variable representing the probability of a single 6, and let y represent the probability of "even but not 6". A single string consisting of even numbers can be written like, say, xyxxy, and this expression (which simplifies to x3y2) is the same as the probability of the string. Now let's find the generating function for all strings you can get in (A). These strings are generated by the following unambiguous regular expression:
y∗(xyy∗)∗xx
The magical property of generating functions is that if you have an unambiguous regular expression, the corresponding generating function is easy to calculate: concatenation becomes product, union becomes sum, and star becomes the function z→1/(1−z). Using this, the generating function for the strings in (A) is
fA(x,y)=x21−y−xy.
Similarly, the strings possible in (B) have unambiguous regular expression y∗xy∗x and generating function fB(x,y)=x2(1−y)2.
If you plug in the probabilities x=1/6,y=1/3, the above functions will give you the probability of a string in (A) occurring and of a string in (B) occurring, respectively. But that's not what we want; we want conditional expectations. To get that, we need the probability of each string to be weighted by its length (then to divide by the overall probability). The length of a string is the number of x and y variables in it -- its degree. So we can get the sum of lengths-times-probabilities by scaling x and y by t, taking a derivative with respect to t, and plugging in t=1. Then we divide by the overall probabilities. So the conditional expectations are
^fA(x,y)=ddtfA(tx,ty)|t=1fA(x,y)=2−y1−y−xy,
^fB(x,y)=ddtfB(tx,ty)|t=1fB(x,y)=21−y.
Now just plug in x=1/6,y=1/3 to get the conditional expectations.