LESSWRONG
LW

All of jeff8765's Comments + Replies

Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

It seems that o4-mini-high (released today) is able to solve the first problem with one attempt, though it needs some prompting to explain its solution. It first asserts that the minimal number of moves is 15. If you ask it to list the moves, it is able to do so, and the list of moves seems valid on my check. If asked to prove that 15 is minimal, it reports that a BFS shows that 15 is minimal.

I'm not sure if this fully counts as a success, as I suspect it wrote code to perform the BFS while generating the answer. It was also unable to point out that,... (read more)

5Kaj_Sotala4d

I'd say that anything that gives the right result counts as a success.

[RETRACTED] It's time for EA leadership to pull the short-timelines fire alarm.

jeff87652mo40

Looking back at the parameters of the bet, it's interesting to me that the benchmark and math components have all fallen, but that the two "real world" components of the bet are still standing.

The First Room-Temperature Ambient-Pressure Superconductor

jeff87652y30

It also seems a patent was filed for this material in 2021 and was granted earlier this year prior to publication.

1sanxiyn2y

The patent is not yet granted.

GPT-4

jeff87652y30

Notably, if you tell it to think step by step it gets the question right and answers with:

This problem is known as the Monty Hall problem. In the original problem, the car is placed randomly behind one of the three doors, and the host always opens a door with a goat after you make your initial choice. However, in your variation, the car is always behind door number 1.

Let's analyze your version step by step:

You initially choose door No. 1, which always has the car behind it.

The host, knowing what's behind the doors, will always open one of the other two doo... (read more)

GPT-4

jeff87652y20

I've found it's ability to be much better as well. In contrast to GPT-3, which often seemed to be unable to keep track of board state and made illegal moves toward the end of the game, it not only played legal moves, it actually mated me. Granted I'm a terrible player and I was deliberately not reading ahead to see if it would be able to mate a weak player. My method was to tell it I wanted to play and then give my move in algebraic notation. It would respond with a move, then I would respond with another. After it beat me, I asked it to list all the moves... (read more)

To what extent is GPT-3 capable of reasoning?

Answer by jeff8765Feb 27, 202340

I recently got access to Bing and asked it about the bullet in temporary gravity of varying duration. It does quite a bit better than GPT-3 though it's very verbose. It does do a search during it's answer but only to find the typical initial velocity of a bullet. It makes an error regarding the final velocity of the bullet after three seconds, but correctly determines that the bullet will go up forever if gravity lasts three seconds but will fall back to Earth if it lasts five minutes. Bold is me, everything else is Bing.

Okay, I’ve cleared the slate ... (read more)

What 2026 looks like

jeff87652y40

It would cause a severe heat dissipation problem. All that energy is going to be radiated as waste heat and, in equilibrium, will be radiated as fast as it comes in. The temperature required to radiate at the requisite power level would be in excess of the temperature at the surface of the sun, any harvesting machinery on the surface of the planet would melt unless it is built from something unknown to modern chemistry.

2Daniel Kokotajlo2y

Seems like a good point. I'd be interested to hear what the authors have to say about that.

College Admissions as a Brutal One-Shot Game

jeff87652y10

In my particular case it wasn't really all that hard. I went to an extremely small school so classes weren't tracked the way they might be at a larger school. Since I was much better at taking tests than my peers I didn't really have to study to get A's on tests. We didn't even have all that much homework, though I guess it probably was hundreds of hours over the course of my high school career. I would have had to do that regardless though.

College Admissions as a Brutal One-Shot Game

jeff87652y129

For me the answer is yes, but my situation is quite non-central. I got into MIT since I was a kid from a small rural town with really good grades, really good test scores, and was on a bunch of sports teams. Because I was from a small rural town and was pretty smart, none of this required special effort other than being on sports teams (note: being on the teams required no special skill as everyone who tried out made the team given small class size). The above was enough to get me an admission probably for reasons of diversity I'm a white man but I'm fairl... (read more)

2cata2y

You say that, but getting really good grades in high school sounds like thousands of hours of grunt work, with very marginal benefit outside college admissions. Maybe it's what you would have done anyway, but I don't think it's what most teenagers would prefer to be doing.

Google's new 540 billion parameter language model

jeff87653y40

But this doesn’t solve the problem of angry customers and media the way firing a misbehaving employee would. Though I suppose this is more an issue of friction/aversion to change than an actual capabilities issue.

What Would A Fight Between Humanity And AGI Look Like?

jeff87653y10

Yeah, but Putin’s been president of Russia for over 20 years and already has a very large, loyal following. There will always be those that enthusiastically follow the party line of the leader. It’s somewhat harder to actually seize power. (None of this is to excuse the actions of Putin or those who support him.)

Google's new 540 billion parameter language model

jeff87653y40

Likely higher than one in a million, but they can be fired after a failure to allow the company to save face. Harder to do that with a $50M language model.

3FeepingCreature3y

Just delete the context window and tweak the prompt.

Google's new 540 billion parameter language model

jeff87653y80

I think the issue here is that the tasks in question don't fully capture everything we care about in terms of language facility. I think this is largely because even very low probabilities of catastrophic actions can preclude deployment in an economically useful way.

For example, a prime use of a language model would be to replace customer service representative. However, if there is even a one in a million chance that your model will start cursing out a customer, offer a customer a million dollars to remedy an error, or start spewing racial epithets, the model cannot be usefully deployed in such a fashion. None of the metrics in the paper can guarantee, or even suggest, that level of consistency.

5FeepingCreature3y

I wonder what the failure probability is for human customer service employees.

We Live in a Post-Scarcity Society

jeff87653y30

One small quibble, you can actually live much more cheaply on rice. A pound of dry rice contains 1600 calories, if you eat 2000 calories a day, you need 5 pounds every 4 days, so a 50 pound bag will last 40 days, meaning you need 9 per year. This has a total cost of $450 at your price. Probably less if you shop around or buy in bulk.

3lsusr3y

Thanks. In my calculations, I used 600 Calories per pound of rice, which is wrong. Edit: Fixed the original post.

The funnel of human experience

jeff87657y290

I think you added an extra three zeros during your total year calculations. you list 2.23E15 as the total number of years experienced, but multiplying the total time of 5E4 by the current population of 8E9 gives a total of only 4E14 experience years. The true number must be quite a bit lower as the human population was quite a bit lower than 8 billion for most of that time. This also affects the proportion of experience years which have occurred in living memory. My guess is 20% have occurred since the birth of Kane Tanaka and 10% experienced by living peo... (read more)

eukaryote7y210

You are super right and that is exactly what happened - I checked the numbers and had made the order of magnitude three times larger. Thanks for the sanity checks and catch. It turns out this moves the midpoint up to 1432. Lemme fix the other numbers as well.

Update: Actually, it did nothing to the midpoint, which makes sense in retrospect (maybe?) but does change the "fraction of time" thing, as well as some of the Fermi estimates in the middle.
15% of experience has actually been experienced by living people, and 28% since Kane Tanaka's birth. I've updated this here and on my blog.