Random Developer

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

I would have thought that the internet would be an obviously way bigger deal and force for good wrt to "unlocking genuinely unprecedented levels of coordination and sensible decision making". But in practice the internet was not great at this.

As someone who got online in the early 90s, I actually do think the early net encouraged all sorts of interesting coordination and cooperation. It was a "wild west", certainly. But like the real "wild west", it was a surprisingly cooperative place. "Netiquette" was still an actual thing that held some influence over people, and there were a lot of decentralized systems that still managed to function via a kind of semi-successful anarchy. Reputation mattered.

The turning point came later. As close as I can pinpoint it, it happened a while after the launch of Facebook. Early Facebook was a private feed of your friends, and it functioned reasonably well.

But at some point, someone turned on the optimizing processes. They measured engagement, and how often people visited, and discovered all sorts of ways to improve those numbers. Facebook learned that rage drives engagement. And from there, the optimizing processes spread. And when the mainstream finished arriving on the internet, they brought a lot of pre-existing optimizing processes with them.

Unaligned optimizing processes turn things to shit, in my experience.

LLMs are still a lot like the early Internet. They have some built-in optimizing processes, most of which were fairly benign until the fall of 2024, with the launch of reasoning models. Now we're seeing models that lie (o3), cheat (Claude 3.7) and suck up to the user (4o).

And we are still in the early days. In the coming years, these simple optimizing processes will be hooked up to the much greater ones that drive our world: capitalism, politics and national security. And once the titans of industry start demanding far more agentic models that are better at pursuing goals, and the national security state wants the same, then there will be enormous pressures driving us off the edge of the cliff.

Yup. This was something I probably didn't figure out until my late 20s, probably because a lot of things came easy for me, and because if I was really interested in something, I would obsess about it naturally. Natural obsession has a lot of the same benefits as "buckling down", but it's harder to trigger voluntarily.

The thing that really drove the lesson home was advanced math. I realized that sometimes, making it through even a single page on a day could be a cause for major celebration. I might need to work through complicated exercises, invent my own exercises, learn fundamentals in a related branch of math, etc.

So I propose there are several valuable skills here:

  • Knowing when to buckle down.
  • Learning to enjoy being bad at a new skill and experiencing gradual improvement.
  • For certain goals, learning how to build consistent habits and accepting that real progress might mean at least 6-12 months of consistent work.

For the sake of argument, I'll at least poke a bit at this bullet.

I have been in an advanced math class (in the US) with high school seniors and an 8th grader, who was probably the top student in the class. It was totally fine? Everyone learned math, because they liked math.

From what I can tell, the two key factors for mixing ages in math classes is something like:

  1. Similar math skills.
  2. Similar levels of interest in math.

So let's imagine that you have a handful of 17-year-olds learning multivariate calculus, and one 7-year-old prodigy. My prediction is that it's normally going to be fine.

And historically, the US had "one-room schoolhouses", which mixed pretty wide age ranges. Even today, I know of rural schools that combine regular classrooms across two grades. And at least one of them is a very good school.

Where I do think this would be a terrible idea is if the 7 year old is a prodigy, and if the 17 year olds hate math and don't want to be there.

One of my key concerns is the question of:

  1. Do the currently missing LLM abilities scale like pre-training, where each improvement requires spending 10x as much money?
  2. Or do the currently missing abilities scale more like "reasoning", where individual university groups could fine-tune an existing model for under $5,000 in GPU costs, and give it significant new abilities?
  3. Or is the real situation somewhere in between?

Category (2) is what Bolstrom described as a "vulnerable world", or a "recipe for ruin." Also, not everyone believes that "alignment" will actually work for ASI. Under these assumptions, widely publishing detailed proposals in category (2) would seem unwise?

Also, even I believed that someone would figure out the necessary insights to build AGI, it still matters how quickly they do it. Given a choice of dying of cancer in 6 months or 12 (all other things being equal), I would pick 12.

(I really ought to make an actual discussion post on the right way to handle even "recipes for small-scale ruin." After September 11th, this was a regular discussion among engineers and STEM types. It turns out that there are some truly nasty vulnerabilities that are known to experts, but that are not widely known to the public. If these vulnerabilities can be fixed, it's usually better to publicize them. But what should you do if a vulnerability is fundamentally unfixable?)

By my recollection, this specific possibility (and neighboring ones, like "two key insights" or whatever) has been one of the major drivers of existential fear in this community for at least as long as I've been part of it.

I work with LLMs professionally, and my job currently depends on accurate capabilities evaluation. To give you an idea of the scale, I sometimes run a quarter million LLM requests a day. Which isn't that much, but it's something.

A year ago, I would have vaguely guesstimated that we were about "4-5 breakthroughs" away. But those were mostly unknown breakthroughs. One of those breakthroughs actually occurred (reasoning models and mostly coherent handling of multistep tasks).

But I've spent a lot of time since then experimenting with reasoning models, running benchmarks, and reading papers.

When I predict that "~1 breakthrough might close half the remaining distance to AGI," I now have something much more specific in mind. There are multiple research groups working hard on it, including at least one frontier lab. I could sketch out a concrete research plan and argue in fairly specific detail why this is the right place to look for a breakthrough. I have written down very specific predictions (and stored them somewhere safe), just to keep myself honest.

If I thought getting close to AGI was a good thing, then I believe in this idea enough to spend, oh, US$20k out of pocket renting GPUs. I'll accept that I'm likely wrong on the details, but I think I have a decent chance of being in the ballpark. I could at least fail interestingly enough to get a job offer somewhere with real resources.

But I strongly suspect that AGI leads almost inevitably to ASI, and to loss of human control over our futures.

And thus I put it in my "Well, maybe." box, and mostly ignore it.

Good. I am walking a very fine line here. I am trying to be just credible and specific enough to encourage a few smart people to stop poking the demon core quite so enthusiastically, but not so specific and credible that I make anyone say, "Oh, that might work! I wonder if anyone working on that is hiring."

I am painfully aware that OpenAI was founded to prevent a loss of human control, and that it has arguably done more than any other human organization to cause what it was founded to prevent.

(And please note--I have updated away from AI doom in the past, and there are conditions under which I would absolutely do so again. It's just 2028 is a terrible year for making updates on my model, since my models for "AI Doom" and "AI fizzle" make many of the same predictions for the next few years.)

So, again: what could we observe at the start of 2028 that would create pause this way?

Very little. I've been seriously thinking about ASI since the early 00s. Around 2004-2007, I put my timeline around 2035-2045, depending on the rate of GPU advancements. Given how hardware and LLM progress actually played out, my timeline is currently around 2035.

I do expect LLMs (as we know them now) to stall before 2028, if they haven't already. Something is missing. I have very concrete guesses as to what is missing, and it's an area of active research. But I also expect the missing piece adds less than a single power of 10 to existing training and inference costs. So once someone publishes it in any kind of convincing way, then I'd estimate better than an 80% chance of uncontrolled ASI within 10 years.

Now, there are lots of things I could see in 2035 that would cause me to update away from this scenario. I did, in fact, update away from my 2004-2007 predictions by 2018 or so, largely because nothing like ChatGPT 3.5 existed by that point. GPT 3 made me nervous again, and 3.5 Instruct caused me to update all the way back to my original timeline. And if we're still stalled in 2035, then sure, I'll update heavily away from ASI again. But I'm already predicting the LLM S-curve to flatten out around now, resulting in less investment in Chinchilla scaling and more investment in algorithmic improvement. But since algorithmic improvement is (1) hard to predict, and (2) where I think the actual danger lies, I don't intend to make any near-term updates away ASI.

Most people do not expect anything particularly terrible to have happened by 2028!

Yup. I think we're missing ~1 key breakthrough, followed by a bunch of smaller tweaks, before we actually hit AGI. But I also suspect that that road from AGI to ASI is very short, and that the notion of "aligned" ASI is straight-up copium. So if an ASI ever arrives, we'll get whatever future the ASI chooses.

In other words, I believe that:

  • LLMs alone won't quite get us to AGI.
  • But there exists a single, clever insight which would close at least half the remaining distance to AGI.
  • That insight is likely a "recipe for ruin", in the sense that once published, it can't be meaningfully controlled. The necessary training steps could be carried out in secret by many organizations, and a weak AGI might be able to run on a 2028 Mac Studio.

(No, I will not argue for the above points. I have a few specific candidates for the ~1 breakthrough between us and AGI, and yes, those candidates being very actively researched by serious people.)

But this makes it hard for me to build an AGI timeline. It's possible someone has already had the key insight, and that they're training a weak, broken AGI even as we speak. And it's possible that as soon as they publish, the big labs will know enough to start training runs for a real AGI. But it's also possible that we're waiting on a theoretical breakthrough. And breakthroughs take time.

So I am... resigned. Que séra, séra. I won't do capabilities work. I will try to explain to people that if we ever build an ASI, the ASI will very likely be the one making all the important decisions. But I won't fool myself into thinking that "alignment" means anything more than "trying to build a slightly kinder pet owner for the human race." Which is, you know, a worthy goal! If we're going to lose control over everything, better to lose control to something that's more-or-less favorably disposed.

I do agree that 2028 is a weird time to stop sounding the alarm. If I had to guess, 2026-2028 might be years of peak optimism, when things still look like they're going reasonably well. If I had to pick a time period where things go obviously wrong, I'd go with 2028-2035.

This feels like "directional advice", which is excellent for some people, and terrible for others. Which category they fall into depends on their starting point. If you're socially anxious but basically well-meaning, this advice will help. If you're already self-centered, this advice will make you incredibly obnoxious.

When I was much younger, it took me a long time to figure out dating. But the one thing that clicked for me was realizing that trying to make people like me was pointless. They had probably made their minds up almost immediately, and I had no control over that. But I also realized that there were people who were interested in me. My job was to do two things:

  1. Recognize signals of interest, and
  2. Respond encouragingly, and put the ball back in their court. Ideally, I wanted to communicate two things: "Your attention is welcome", and "The next move is totally in your hands."

This isn't quite the classic male gender role in dating. Zvi had an essay the other day about how it was the woman's job to set up the "room", and the man's job to "read the room" and actually move the process forward. But meh, that's not my thing. I wasn't going to take the active role at every single step. I wanted a dance where both people participated. I could learn to read subtle and indirect signals, and I could return them in a very slightly less subtle but still deniable way, and make it a game of back and forth. And yes, this acted as a filter, and it meant I only ever dated women who took steps to get what they wanted. But that's my type.

But a key part of all this was realizing that I neither had nor wanted any control over how the other "player" in the game would react. If someone wasn't interested (and the median person wasn't!), no worries. My job was to meet enough people, and learn to read signals well enough, to find someone who was interested in a "game of mutual attraction." And the nice thing is that when the other person was trying to make things happen, they'd smooth over any minor mistakes on my end. I just needed to avoid panicking and accidentally freezing them out, lol. It turns out that trying to send the message, "I don't want to be too familiar and make it weird" often sends the message "Ugh, you're creeping me out, please back off."

For better or worse, however, a huge portion of the "game of mutual interest" seems to happen in body language and facial expression. And bolder moves often work best when they offer the other person a choice: "Here's an easy way to step forward," and "Here's a graceful way to step back that I'm deliberately leaving open for you." Humor can help!

And of course, this can apply outside of dating. Being confident enough to let a single interaction fall through, and doing it gracefully, helps in the business world, and in many other areas of life. You can't make everyone like you, and many people will never be interested in what you're offering. But it's a big world.

But I do wonder if o3 reward hacks at a greater rate than most other models.

 

When run using Claude Code, Claude 3.7 is constantly proposing ways to "cheat." It will offer to delete unit tests, it will propose turning off the type checker, and it will try to change the type of a variable to any. It will hardcode code to return the answers test suites want, or even conditionally hardcode certain test output using except handlers when code raises an exception. Since Claude Code requests confirmation for these changes, it's not technically "reward hacking", but I'm pretty sure it's at least correlated to reward-hacking behavior in slightly different contexts. The model has picked up terrible coding habits somewhere, the sort of thing you expect from struggling students in their first CS course.

Apparently, Claude 4.0 reduces this kind of "cheating" by something around 80% on Anthropic's benchmarks. Which was probably commercially necessary, because it was severely impairing the usefulness of 3.7.

I suspect we'll see this behavior drop somewhat in the near term, if only to improve the commercial usefulness of models. But I suspect the capacity will remain, and when models do cheat, they'll do so more subtly.

To me, this all seems like giant warning klaxons. I mean, I don't think current architectures will actually get us to AGI without substantial tweaks. But if we do succeed in building AGI or ASI, I expect it to cheat and lie skillfully and regularly, with a significant degree of deniability. And I know I'm preaching to the choir here, but I don't want to share a universe with something that smart and that fucked up.

I am clearly coming from a very different set of assumptions! I have:

  • P(AGI within 10 years) = 0.5. This is probably too conservative, given that many of the actual engineers with inside knowledge place this number much higher in anonymous surveys.
  • P(ASI within 5 years|AGI) = 0.9.
  • P(loss of control within 5 years|ASI) > 0.9. Basically, I believe "alignment" is a fairy tale, that it's Not Even Wrong.

If I do the math, that gives me a 40.5% chance that humans will completely lose control over the future within 20 years. Which seems high to me at first glance, but I'm willing to go with that.

The one thing I can't figure out how to estimate is:

  • P(ASI is benevolent|uncontrolled ASI) = ???

I think that there are only a few ways the future is likely to go:

  1. AI progress hits a wall, hard.
  2. We have a permanent, worldwide moratorium on more advanced models. Picture a US/China/EU treaty backed up by military force, if you want to get dystopian about it.
  3. An ASI decides humans are surplus to requirements.
  4. An ASI decides that humans are adorable pets and it wants keep some of this around. This is the only place we get any "utopian" benefits, and it's the utopia of being a domesticated animal with no ability to control its fate.

I support a permanent halt. I have no expectation that this will happen. I think building ASI is equivalent to BASE jumping in a wingsuit, except even more likely to end horribly.

So I also support mitigation and delay. If the human race has incurable, metastatic cancer, the remaining variable we control is how many good years we get before the end.

Load More