Looking back at the parameters of the bet, it's interesting to me that the benchmark and math components have all fallen, but that the two "real world" components of the bet are still standing.
It also seems a patent was filed for this material in 2021 and was granted earlier this year prior to publication.
Notably, if you tell it to think step by step it gets the question right and answers with:
This problem is known as the Monty Hall problem. In the original problem, the car is placed randomly behind one of the three doors, and the host always opens a door with a goat after you make your initial choice. However, in your variation, the car is always behind door number 1.
Let's analyze your version step by step:
You initially choose door No. 1, which always has the car behind it.
The host, knowing what's behind the doors, will always open one of the other two doo...
I've found it's ability to be much better as well. In contrast to GPT-3, which often seemed to be unable to keep track of board state and made illegal moves toward the end of the game, it not only played legal moves, it actually mated me. Granted I'm a terrible player and I was deliberately not reading ahead to see if it would be able to mate a weak player. My method was to tell it I wanted to play and then give my move in algebraic notation. It would respond with a move, then I would respond with another. After it beat me, I asked it to list all the moves...
I recently got access to Bing and asked it about the bullet in temporary gravity of varying duration. It does quite a bit better than GPT-3 though it's very verbose. It does do a search during it's answer but only to find the typical initial velocity of a bullet. It makes an error regarding the final velocity of the bullet after three seconds, but correctly determines that the bullet will go up forever if gravity lasts three seconds but will fall back to Earth if it lasts five minutes. Bold is me, everything else is Bing.
Okay, I’ve cleared the slate ...
It would cause a severe heat dissipation problem. All that energy is going to be radiated as waste heat and, in equilibrium, will be radiated as fast as it comes in. The temperature required to radiate at the requisite power level would be in excess of the temperature at the surface of the sun, any harvesting machinery on the surface of the planet would melt unless it is built from something unknown to modern chemistry.
In my particular case it wasn't really all that hard. I went to an extremely small school so classes weren't tracked the way they might be at a larger school. Since I was much better at taking tests than my peers I didn't really have to study to get A's on tests. We didn't even have all that much homework, though I guess it probably was hundreds of hours over the course of my high school career. I would have had to do that regardless though.
For me the answer is yes, but my situation is quite non-central. I got into MIT since I was a kid from a small rural town with really good grades, really good test scores, and was on a bunch of sports teams. Because I was from a small rural town and was pretty smart, none of this required special effort other than being on sports teams (note: being on the teams required no special skill as everyone who tried out made the team given small class size). The above was enough to get me an admission probably for reasons of diversity I'm a white man but I'm fairl...
But this doesn’t solve the problem of angry customers and media the way firing a misbehaving employee would. Though I suppose this is more an issue of friction/aversion to change than an actual capabilities issue.
Yeah, but Putin’s been president of Russia for over 20 years and already has a very large, loyal following. There will always be those that enthusiastically follow the party line of the leader. It’s somewhat harder to actually seize power. (None of this is to excuse the actions of Putin or those who support him.)
Likely higher than one in a million, but they can be fired after a failure to allow the company to save face. Harder to do that with a $50M language model.
I think the issue here is that the tasks in question don't fully capture everything we care about in terms of language facility. I think this is largely because even very low probabilities of catastrophic actions can preclude deployment in an economically useful way.
For example, a prime use of a language model would be to replace customer service representative. However, if there is even a one in a million chance that your model will start cursing out a customer, offer a customer a million dollars to remedy an error, or start spewing racial epithets, the model cannot be usefully deployed in such a fashion. None of the metrics in the paper can guarantee, or even suggest, that level of consistency.
One small quibble, you can actually live much more cheaply on rice. A pound of dry rice contains 1600 calories, if you eat 2000 calories a day, you need 5 pounds every 4 days, so a 50 pound bag will last 40 days, meaning you need 9 per year. This has a total cost of $450 at your price. Probably less if you shop around or buy in bulk.
I think you added an extra three zeros during your total year calculations. you list 2.23E15 as the total number of years experienced, but multiplying the total time of 5E4 by the current population of 8E9 gives a total of only 4E14 experience years. The true number must be quite a bit lower as the human population was quite a bit lower than 8 billion for most of that time. This also affects the proportion of experience years which have occurred in living memory. My guess is 20% have occurred since the birth of Kane Tanaka and 10% experienced by living peo...
You are super right and that is exactly what happened - I checked the numbers and had made the order of magnitude three times larger. Thanks for the sanity checks and catch. It turns out this moves the midpoint up to 1432. Lemme fix the other numbers as well.
Update: Actually, it did nothing to the midpoint, which makes sense in retrospect (maybe?) but does change the "fraction of time" thing, as well as some of the Fermi estimates in the middle.
15% of experience has actually been experienced by living people, and 28% since Kane Tanaka's birth. I've updated this here and on my blog.
It seems that o4-mini-high (released today) is able to solve the first problem with one attempt, though it needs some prompting to explain its solution. It first asserts that the minimal number of moves is 15. If you ask it to list the moves, it is able to do so, and the list of moves seems valid on my check. If asked to prove that 15 is minimal, it reports that a BFS shows that 15 is minimal.
I'm not sure if this fully counts as a success, as I suspect it wrote code to perform the BFS while generating the answer. It was also unable to point out that,... (read more)