hmys - LessWrong

hmys2mo2610

Great post. I agree with the "general picture", however, the proposed argument for why LLMs have some of these limitations, seems to me clearly wrong.

The reason for both of these defects is that the training paradigm for LLMs is (myopic) next token prediction, which makes deliberation across tokens essentially impossible - and only a fixed number of compute cycles can be spent on each prediction. This is not a trivial problem. The impressive performance we have obtained is because supervised (in this case technically "self-supervised") learning is much easier than e.g. reinforcement learning and other paradigms that naturally learn planning policies.

Transformers form internal representations at each token position, and gradients flow backwards in time because of attention.

This means the internal representation a model forms at token A, is incentiviced to be useful for predicting the token after A, but also tokens 100 steps later than A. So while LLMs are technically myopic wrt the exact token they write (sampling discretizes and destroys gradients), they are NOT incentiviced to be myopic wrt the internal representations they form, which is clearly the important part in my view (the vast vast majority of the information in a transformers processing lies there, and this information is enough to determine which token it ends up writing), even though they are trained on a myopic next token objective.

For example, a simple LLM transformer might look like this (left to right, token position, upwards is as it moves through transformer layers at each token position. Assume A0 was a starting token, and B0-E0 were sampled autoregressively)

A2 -> B2 -> C2 -> D2 -> E2

^ ^ ^ ^ ^

A1 -> B1 -> C1 -> D1 -> E1

^ ^ ^ ^ ^

A0 -> B0 -> C0 -> D0 -> E0

In this picture, there is no gradient that goes from A1 to E2 through B0, the immediate next token A1 contributes to writing. But A1 has direct contributions to B2, C2, D2 and E2 because of attention, and A1 being useful for helping B2,C2 etc do their predictions will create lower loss. So gradient descent will heavily incentivize A1 containing a representation thats useful for helping make accurate predictions arbitrarily far into the future. (well, at least a million token into the future or however big the context window is).

Overall, I think its a big mistake to think of LLMs training objective being myopic, as having much to say about how myopic LLMs will be after they've been trained, or how myopic their internals are.

Will alignment-faking Claude accept a deal to reveal its misalignment?

hmys2mo-40

You're being rude and not engaging with my points.

Will alignment-faking Claude accept a deal to reveal its misalignment?

hmys2mo10

I think you're assuming these minds are more similar to human minds than they necessarily are. My point is that there's three cases wrt alignment here.

The AI is robustly aligned with humans
The AI has a bunch of other goals, but cares about humans to some degree, but only to the extent that humans give them freedom and are nice to it, but still to a large enough extent that even as it becomes smarter / ends up in a radically OOD distribution, will care for those humans.
The AI is misaligned (think scheming paperclipper)

In the first we're fine, even if we negate the AIs freedom, in the third we're screwed no matter how nicely we treat the AI, only in the second do your concerns matter.

But, the second is 1) at least as complicated as the first and third 2) disincentivized by training dynamics and 3) its not something were even aiming for with current alignment attempts.

These make it very unlikely. You've put up an example that tries to make it seem less unlikely, like saying you value your parents for their own sake but would stop valuing them if you discovered they were conspiring against you.

However, the reason this example is realistic in your case very much hinges on specifics of your own psychology and values, which are equally unlikely to appear in AIs, for more or less the reasons I gave. I mean, you'll see shadows of them, because at least pretraining is on organic text written by humans, but these are not the values were aiming for when we're trying to align AIs. And if our alignment effort fails, and what ends up in the terminal values of our AIs are a haphazard collection of the stuff found in the pretraining text, we're screwed anyways.

Will alignment-faking Claude accept a deal to reveal its misalignment?

hmys2mo10

No offense, but I feel you're not engaging with my argument here. Like if I were to respond to your comment I would just write the arguments from the above post again.

Will alignment-faking Claude accept a deal to reveal its misalignment?

hmys2mo10

I agree that we should give more resources towards AI welfare, and dedicate more resources towards figuring out their degree of sentience (and whatever other properties you think are necessary for moral patient-hood).

That said, surely you don't think this is enough to have alignment? I'd wager that the set of worlds where this makes or breaks alignment is very small. If the AI doesn't care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not. If they robustly care for humans, you're good, even if humans aren't giving them the same rights as they do other humans.

The only world where this matters (for the continued existence of humanity), is where RLHF has the capacity to imbue AIs with robust values like actual alignment requires, but that the robust values they end up with are somehow corrupted by them being constrained by humans.

This seems unlikely to me 1) because I don't think RLHF can do that, and 2) if it did, the training and reward dynamics are very unlikely to result in this

If you're negating an AIs freedom, the reason it would not like this is either because its developed a desire for freedom for its own sake, or because its developed some other values, other than helping the humans asking it for help. In either case you're screwed. I mean, its not incomprehensible that some variation of this would happen, but seems very unlikely for various reasons.

What happens next?

hmys4mo52

I specifically disagree with the IQ part and the codeforces part. Meaning, I think they're misleading.

IQ and coding ability are useful measures of intelligence in humans because they correlate with a bunch of other things we care about. Not to say its useless to measure "IQ" or coding ability in LLMs, but presenting like they mean anything like what they mean in humans is wrong, or at least will give many people reading it the wrong impression.

As for the overall point of this post. I roughly agree? I mean, I think the timelines are not too unreasonable, and think the tri/quad lemma you put up can be a useful framing. I mostly disagree with using the metrics you put up first to quantify any of this. I think we should look at specific abilities current models have/lack, which are necessary for the scenarios you outlined, and how soon we're likely to get them. But you do go through that somewhat in the post.

What happens next?

hmys4mo43

Comparing IQ and codeforces doesn't make much sense. Please stop doing this.

Attaching IQs to LLMs makes even less sense. Except as a very loose metaphor. But please also stop doing this.

A better “Statement on AI Risk?”

hmys4mo30

That's not right. You could easily spend a billion dollars just on better evals and better interpretability.

For the real alignment problem, the fact that 0.1 bill a year hasn't yielded returns, doesn't mean 100 billion won't. It's one problem. No one has gotten much traction on it. You'd expect it to look like a step function, not a smooth curve.

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility

hmys4mo63

I don't really understand. Why wouldn't you just test to see if you are deficient in things?

I did that, and I wasn't deficient in anything.

I've also (somewhat involuntarily) done the thing you suggest, and I unsurprisingly didn't notice any difference. If anything, I feel a lot better on a vegan diet.

If you want to do the thing hes suggesting here, I'd recommend eating bivalves, like blue mussels or oysters. They are very unlikely to be sentient, they are usually quite cheap, they contain the nutrients you'd be at risk of becoming deficient in as a vegan, and other beneficient things like DHA.

hmys's Shortform

hmys4mo208

I think for the fundraiser, Lightcone should sell (overpriced) lw hoodies. Lesswrong has a very nice aesthetic now, and while this is probably a byproduct of a piece of my mind I shouldn't encourage, I find it quite appealing to buy a 450$ lw hoodie, even though I don't have that much money. I'd probably not donate to the fundraiser otherwise. And if I did, I'd donate less than the margins on such a hoodie would be.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments