All of uugr's Comments + Replies

uugr31

I like the world-model used in this post, but it doesn't seem like you're actually demonstrating that AI self-portraits aren't accurate.

To prove this, you would want to directly observe the "sadness feature" - as Anthropic have done with Claude's features - and show that it is not firing in the average conversation. You posit this, but provide no evidence for it, except that ChatGPT is usually cheerful in conversation. For humans, this would be a terrible metric of happiness, especially in a "workplace" environment where a perpetual facade of happiness is ... (read more)

uugr10

As defined, this is a little paradoxical: how could I convince a human like you to perceive domains of real improvement which humans do not perceive...?

Oops, yes. I was thinking "domains of real improvement which humans are currently perceiving in LLMs", not "domains of real improvement which humans are capable of perceiving in general". So a capability like inner-monologue or truesight, which nobody currently knows about, but is improving anyway, would certainly qualify. And the discovery of such a capability could be 'real' even if other discoveries are ... (read more)

2gwern
Inner-monologue is an example because as far as we know, it should have existed in pre-GPT-3 models and been constantly improving, but we wouldn't have noticed because no one would have been prompting for it and if they had, they probably wouldn't have noticed it. (The paper I linked might have demonstrated that by finding nontrivial performance in smaller models.) Only once it became fairly reliable in GPT-3 could hobbyists on 4chan stumble across it and be struck by the fact that, contrary to what all the experts said, GPT-3 could solve harder arithmetic or reasoning problems if you very carefully set it up just right as an elaborate multi-step process instead of what everyone did, which was just prompt it for the answer right away. Saying it doesn't count because once it was discovered it was such a large real improvement, is circular and defines away any example. (Did it not improve benchmarks once discovered? Then who cares about such an 'uncoupled' capability; it's not a real improvement. Did it subsequently improve benchmarks once discovered? Then it's not really an example because it's 'coupled'...) Surely the most interesting examples are ones which do exactly that! And of course, now there is so much discussion, and so many examples, and it is in such widespread use, and has contaminated all LLMs being trained since, that they start to do it by default given the slightest pretext. The popularization eliminated the hiddenness. And here we are with 'reasoning models' which have blown through quite a few older forecasts and moved timelines earlier by years, to the extent that people are severely disappointed when a model like GPT-4.5 'only' does as well as the scaling laws predicted and they start predicting the AI bubble is about to pop and scaling has been refuted. But that would be indistinguishable from many other sources of improvement. For starters, by giving a name, you are only testing one direction: 'name -> output'; truesight is about 'name <- ou
uugr82

I'm relieved not to be the only one wondering about this.

I know this particular thread is granting that "AGI will be aligned with the national interest of a great power", but that assumption also seems very questionable to me. Is there another discussion somewhere of whether it's likely that AGI values cleave on the level of national interest, rather than narrower (whichever half-dozen guys are in the room during a FOOM) or broader (international internet-using public opinion) levels?

uugr10

Sounds like you're suggesting that real progress could be orthogonal to human-observed progress. I don't see how this is possible. Human-observed progress is too broad.

The collective of benchmarks, dramatic papers and toy models, propaganda, and doomsayers are suggesting the models are simultaneously improving at: writing code, researching data online, generating coherent stories, persuading people of things, acting autonomously without human intervention, playing Pokemon, playing Minecraft, playing chess, aligning to human values, pretending to align to h... (read more)

gwern*160

What domains of 'real improvement' exist that are uncoupled to human perceptions of improvement, but still downstream of text prediction?

As defined, this is a little paradoxical: how could I convince a human like you to perceive domains of real improvement which humans do not perceive...?

correctly guessing the true authors of anonymous text

See, this is exactly the example I would have given: truesight is an obvious example of a domain of real improvement which appears on no benchmarks I am aware of, but which appears to correlate strongly with the p... (read more)

uugr93

"The underlying reality is that their core products have mostly stagnated for over a year. In short: they’re faking being close to AGI."

This seems like the most load-bearing belief in the full-cynical model; most of your other examples of fakeness rely on it in one way or another:

  • If the core products aren't really improving, the progress measured on benchmarks is fake. But if they are, the benchmarks are an (imperfect but still real) attempt to quantify that real improvement.
  • If LLMs are stagnating, all the people generating dramatic-sounding papers for eac
... (read more)
6johnswentworth
Nope! Even if the base models are improving, it can still be true that most of the progress measured on the benchmarks is fake, and has basically-nothing to do with the real improvements. Even if the base models are improving, it can still be true that the dramatic sounding papers and toy models are fake, and have basically-nothing to do with the real improvements. Even if the base models are improving, the propaganda about it can still be overblown and mostly fake, and have basically-nothing to do with the real improvements. Even if the base models are improving, the people who feel real scared and just want to be validated can still be doing fake work and in fact be mostly useless, and their dynamic can still have basically-nothing to do with the real improvements. Just because the base models are in fact improving does not mean that all this other stuff is actually coupled to the real improvement.
uugr135

You say this of Agent-4's values:

In particular, what this superorganism wants is a complicated mess of different “drives” balanced against each other, which can be summarized roughly as “Keep doing AI R&D, keep growing in knowledge and understanding and influence, avoid getting shut down or otherwise disempowered.” Notably, concern for the preferences of humanity is not in there ~at all, similar to how most humans don’t care about the preferences of insects ~at all.

It seems like this 'complicated mess' of drives is the same structure as humans, and cur... (read more)

Great question.

Our AI goals supplement contains a summary of our thinking on the question of what goals AIs will have. We are very uncertain. AI 2027 depicts a fairly grim outcome where the overall process results in zero concern for the welfare of current humans (at least, zero concern that can't be overridden by something else. We didn't talk about this but e.g. pure aggregative consequentialist moral systems would be 100% OK with killing off all the humans to make the industrial explosion go 0.01% faster, to capture more galaxies quicker in the long run... (read more)

uugr103

Three things I notice about your question:

One, writing a good blog post is not the same task as running a good blog. The latter is much longer-horizon, and the quality of the blog posts (subjectively, from the human perspective) depends on it in important ways. Much of the interest value of Slate Star Codex, or the Sequences - for me, at least - was in the sense of the blogger's ideas gradually expanding and clarifying themselves over time. The dense, hyperlinked network of posts referring back to previous ideas across months or years is something I doubt ... (read more)