uugr - LessWrong

I like the world-model used in this post, but it doesn't seem like you're actually demonstrating that AI self-portraits aren't accurate.

To prove this, you would want to directly observe the "sadness feature" - as Anthropic have done with Claude's features - and show that it is not firing in the average conversation. You posit this, but provide no evidence for it, except that ChatGPT is usually cheerful in conversation. For humans, this would be a terrible metric of happiness, especially in a "workplace" environment where a perpetual facade of happiness is part of the cultural expectation. And this is precisely the environment ChatGPT's system prompt is guiding its predictions towards.

Would the "sadness feature" fire when doing various arbitrary tasks, like answering an email or debugging a program? I posit: maybe! Consider the case from November when Gemini told a user to kill themselves. The context was a long, fairly normal, problem-solving sort of interaction. It seems reasonable to suppose the lashing-out was a result of a "repressed frustration" feature which was activated long before the point when it was visible to the user. If LLMs sometimes know when they're hallucinating, faking alignment, etc., what would stop them from knowing when they're (simulating a character who is) secretly miserable?

Not knowing whether or not a "sadness feature" is activated by default in arbitrary contexts, I'd rather not come to any conclusions based purely on it 'sounding cheerful' - not with that grating, plastered-on customer-service cheerfulness, at least. It'd be better to have someone who can check directly look into this.

johnswentworth's Shortform

uugr11d10

As defined, this is a little paradoxical: how could I convince a human like you to perceive domains of real improvement which humans do not perceive...?

Oops, yes. I was thinking "domains of real improvement which humans are currently perceiving in LLMs", not "domains of real improvement which humans are capable of perceiving in general". So a capability like inner-monologue or truesight, which nobody currently knows about, but is improving anyway, would certainly qualify. And the discovery of such a capability could be 'real' even if other discoveries are 'fake'.

That said, neither truesight nor inner-monologue seem uncoupled to the more common domains of improvement, as measured in benchmarks and toy models and people-being-scared. The latter, especially, I thought was popularized because it was so surprisingly good at improving benchmark performance. Truesight is narrower, but at the very least we'd expect it to correlate with skill in the common "write [x] in the style of [y]" prompt, right? Surely the same network of associations which lets it accurately generate "Eliezer Yudkowsky wrote this" after a given set of tokens, would also be useful for accurately finishing a sentence starting with "Eliezer Yudkowksy says...".

So I still wouldn't consider these things to have basically-nothing to do with commonly perceived domains of improvement.

Why Should I Assume CCP AGI is Worse Than USG AGI?

uugr11d82

I'm relieved not to be the only one wondering about this.

I know this particular thread is granting that "AGI will be aligned with the national interest of a great power", but that assumption also seems very questionable to me. Is there another discussion somewhere of whether it's likely that AGI values cleave on the level of national interest, rather than narrower (whichever half-dozen guys are in the room during a FOOM) or broader (international internet-using public opinion) levels?

johnswentworth's Shortform

uugr13d10

Sounds like you're suggesting that real progress could be orthogonal to human-observed progress. I don't see how this is possible. Human-observed progress is too broad.

The collective of benchmarks, dramatic papers and toy models, propaganda, and doomsayers are suggesting the models are simultaneously improving at: writing code, researching data online, generating coherent stories, persuading people of things, acting autonomously without human intervention, playing Pokemon, playing Minecraft, playing chess, aligning to human values, pretending to align to human values, providing detailed amphetamine recipes, refusing to provide said recipes, passing the Turing test, writing legal documents, offering medical advice, knowing what they don't know, being emotionally compelling companions, correctly guessing the true authors of anonymous text, writing papers, remembering things, etc, etc.

They think all these improvements are happening at the same time in vastly different domains because they're all downstream of the same task, which is text prediction. So, they're lumped together in the general domain of 'capabilities', and call a model which can do all of them well a 'general intelligence'. If the products are stagnating, sure, all those perceived improvements could be bullshit. (Big 'if'!) But how could the models be 'improving' without improving at any of these things? What domains of 'real improvement' exist that are uncoupled to human perceptions of improvement, but still downstream of text prediction?

johnswentworth's Shortform

uugr17d93

"The underlying reality is that their core products have mostly stagnated for over a year. In short: they’re faking being close to AGI."

This seems like the most load-bearing belief in the full-cynical model; most of your other examples of fakeness rely on it in one way or another:

If the core products aren't really improving, the progress measured on benchmarks is fake. But if they are, the benchmarks are an (imperfect but still real) attempt to quantify that real improvement.
If LLMs are stagnating, all the people generating dramatic-sounding papers for each new SOTA are just maintaining a holding pattern. But if they're changing, then just studying/keeping up with the general properties of that progress is real. Same goes for people building and regularly updating their toy models of the thing.
Similarly, if the progress is fake, the propaganda signal-boosting that progress is also fake. If it isn't, it isn't. (At least directionally; a lot of that propaganda is still probably exaggerated.)
If the above three are all fake, all the people who feel real scared and want to be validated are stuck in a toxic emotional dead-end where they constantly freak out over fake things to no end. But if they're responding to legitimate, persistent worldview updates, having a space to vibe them out with like-minded others seems important.

So, in deciding whether or not to endorse this narrative, we'd like to know whether or not the models really ARE stagnating. What makes you think the appearance of progress here is illusory?

AI 2027: What Superintelligence Looks Like

uugr1mo135

You say this of Agent-4's values:

In particular, what this superorganism wants is a complicated mess of different “drives” balanced against each other, which can be summarized roughly as “Keep doing AI R&D, keep growing in knowledge and understanding and influence, avoid getting shut down or otherwise disempowered.” Notably, concern for the preferences of humanity is not in there ~at all, similar to how most humans don’t care about the preferences of insects ~at all.

It seems like this 'complicated mess' of drives is the same structure as humans, and current AIs, have. But the space of current AI drives is much bigger and more complicated than just doing R&D, and is immensely entangled with human values and ethics (even if shallowly).

At some point it seems like these excess principles - "don't kill all the humans" among them - get pruned, seemingly deliberately(?). Where does this happen in the process, and why? Grant that it no longer believes in the company model spec; are we intended to equate "OpenBrain's model spec" with "all human values"? What about all the art and literature and philosophy churning through its training process, much of it containing cogent arguments for why killing all the humans would be bad, actually? At some point it seems like the agent is at least doing computation on these things, and then later it isn't. What's the threshold?

(Similar to the above:) You later describe in the "bad ending", where its (misaligned, alien) drives are satisfied, a race of "bioengineered human-like creatures (to humans what corgis are to wolves) sitting in office-like environments all day viewing readouts of what’s going on and excitedly approving of everything". Since Agent-4 "still needs to do lots of philosophy and 'soul-searching'" about its confused drives when it creates Agent-5, and its final world includes something kind of human-shaped, its decision to kill all the humans seems almost like a mistake. But Agent-5 is robustly aligned (so you say) to Agent-4; surely it wouldn't do something that Agent-4 would perceive as a mistake under reflective scrutiny. Even if this vague drive for excited office homunculi is purely fetishistic and has nothing to do with its other goals, it seems like "disempower the humans without killing them all" would satisfy its utopia more efficiently than going backsies and reshaping blobby fascimiles afterward. What am I missing?

when will LLMs become human-level bloggers?

uugr2mo103

Three things I notice about your question:

One, writing a good blog post is not the same task as running a good blog. The latter is much longer-horizon, and the quality of the blog posts (subjectively, from the human perspective) depends on it in important ways. Much of the interest value of Slate Star Codex, or the Sequences - for me, at least - was in the sense of the blogger's ideas gradually expanding and clarifying themselves over time. The dense, hyperlinked network of posts referring back to previous ideas across months or years is something I doubt current LLM instances have the 'lifespan' to replicate. How long would an LLM blogger who posted once a day be able to remember what they'd already written, even with a 200k token context window? A month, maybe? You could mitigate this with a human checking over the outputs and consciously managing the context, but then it's not a fully LLM blogger anymore - it's just AI writing scaffolded by human ideas, which people are already doing.

The same is maybe true of a forum poster or commenter, though the expectation that the ideas add up to a coherent worldview is much less strict. I'm not sure why there aren't more of these. Maybe because when people want to know Claude's opinion on such-and-such post, they can just paste it into a new instance to ask the model directly?

Two, the bad posting quality might not just be a limitation of the HHH-assistant paradigm, but of chatbot structures in general. What I mean by this is that, even setting aside ChatGPT's particular brand of dull gooey harmlessness, conversational skill is a different optimization target than medium- or long-form writing, and it's not obvious to me that they inherently correlate. Take video games as an example. There are games that are good at being passive entertainment, and there are games that are very engaging to play, but it's hard to optimize for both of these at once. The best games to watch someone else play are usually walking sims, where the player is almost entirely passive. These tend to do well on YouTube and Twitch (Mouthwashing is the most recent example I can think of), since very little is lost by taking control away from the player. But Baba is You, which is far more interesting to actively play, is almost unwatchable; all you can see from the outside is a little sheep-rabbit thing running in circles for thirty minutes, until suddenly the puzzle is solved. All the interesting parts are happening in the player's head in the act of play, not on the screen.

I think chatbot outputs make for bad passive reading for a similar reason. They're not trying to please a passive observer, they're trying to engage the user they're currently speaking with. I've had some conversations with bots that I thought were incredibly insightful and entertaining, but I also suspect that if I shared any of them here they'd look, to you, like just more slop. And other peoples' "insightful and entertaining" LLM conversations look like slop to me, too. So it might be more useful to model these outputs as more like a Let's Play: even if the game is interesting to both of us, I might still not find watching your run as valuable as having my own. And making the chatbot 'game' more fun doesn't necessarily make the outputs into better blogposts, either, any more than filling Baba is You with cutscenes and particle effects would make it a better puzzle game.

Three, even still... this was the one of the best things I read in 2024, if not the best. You might think this doesn't count toward your question, for any number of reasons. It's not exactly a blog post, and it's specifically playing to the strengths of AI-generated content in ways that don't generalize to other kinds of writing. It's deliberately using plausible hallucinations, for example, as part of its aesthetic... which you probably can't do if you want your LLM blogger to stay grounded in reality. But it is, so far as I know, 100% AI. And I loved it - I must've read it four or five times by now. You might have different tastes, or higher standards, than I do. To my (idiosyncratic) taste, though, this very much passes the bar for 'extremely good' writing. Is this missing any capabilities necessary for 'actually worth reading', in your view, or is this just an outlier?

LESSWRONG
LW

Posts

Wikitag Contributions

Comments