yams — LessWrong

LESSWRONG
LW

The IABIED statement is not literally true

I think my crux is ‘how much does David’s plan resemble the plans labs actually plan to pursue?’

I read Nate and Eliezer as baking in ‘if the labs do what they say they plan to do, and update as they will predictably update based on their past behavior and declared beliefs’ to all their language about ‘the current trajectory’ etc etc.

I don’t think this resolves ‘is the tittle literally true’ in a different direction if it’s the only crux, and agree that this should have been spelled out more explicitly in the book (e.g. ‘in detail, why are the authors pessimistic about current safety plans’) from a pure epistemic standpoint (although think it was reasonable to omit from a rhetorical standpoint, given the target audience) and in various Headline Sentences throughout the book, and The Problem.

One generous way to read Nate and Eliezer here is to say ‘current techniques’ is itself intending to bake in ‘plans the labs currently plan to pursue’. I was definitely reading it this way, but think it’s reasonable for others not to. If we read it that way, and take David’s plan above to be sufficiently dissimilar from real lab plans, then I think the title’s literal interpretation goes through.

[your post has updated me from ‘the title is literally true’ to ‘the title is basically reasonable but may not be literally true depending on how broadly we construe various things’, which is a significantly less comfortable position!]

If Anyone Builds It Everyone Dies, a semi-outsider review

yams2d20

I want to vouch for Eli as a great person to talk with about this. He has been around a long time, has done great work on a few different sides of the space, and is a terrific communicator with a deep understanding of the issues.

He’s run dozens of focus-group style talks with people outside the space, and is perhaps the most practiced interlocutor for those with relatively low context.

[in case OP might think of him as some low-authority rando or something and not accept the offer on that basis]

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

yams9d33

You’re disagreeing with a claim I didn’t intend to make.

I was unclear in my language and shouldn’t have used ‘contains’. Sorry! Maybe ‘relaying’ would have avoided this confusion.

I think you’re not objecting to the broader point other than by saying ‘neuralese requires very high bandwidth’, but LLMs have a lot of potential associations that can be made in processing a single token (which is, potentially, an absolute ton of bandwidth).

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

yams9d40

@StanislavKrym can you explain your disagree vote?

Strings of numbers are shown to transmit a fondness for owls. Numbers have no semantic content related to owls. This seems to point to ‘tokens containing much more information than their semantic content’, doesn’t it?

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

yams10d*30

Doesn't this have implications for the feasibility of neuralese? I've heard some claims that tokens are too low-bandwidth for neuralese to work for now, but this seems to point at tokens containing (edit: I should have said something like ‘relaying’ or ‘invoking’ rather than ‘containing’) much more information than their semantic content.

LLMs are badly misaligned

yams14d*31

I'm not sure how useful I find hypotheticals of the form 'if Claude had its current values [to the extent we can think of Claude as a coherent enough agent to have consistent values, etc etc], but were much more powerful, what would happen?' A more powerful model would be likely to have/evince different values from a less powerful model, even if they were similar architectures subjected to similar training schema. Less powerful models also don't need to be as well-aligned in practice, if we're thinking of each deployment as a separate decision-point, since they're of less consequence.

I understand that you're in-part responding to the hypothetical seeded by Nina's rhetorical line, but I'm not sure how useful it is when she does it, either.

LLMs are badly misaligned

yams14d20

I don't think the quote from Ryan constitutes a statement on his part that current LLMs are basically aligned. He's quoting a hypothetical speaker to illustrate a different point. It's plausible to me that you can find a quote from him that is more directly in the reference class of Nina's quote, but as-is the inclusion of Ryan feels a little unfair.

If Anyone Builds It, Everyone Dies: Call for Translators (for Supplementary Materials)

yams17d20

There should be announcements through the intelligence.org newsletter (as well as on the authors’ twitters) when those dates are announced (the deals were signed for some already, and more are likely to come, but they don’t tell you the release date when you sign the deal!).

A non-review of "If Anyone Builds It, Everyone Dies"

yams20d40

The situation you're describing definitely concerns me, and is about mid-way up the hierarchy of nested problems as I see it (I don't mean 'hierarchy of importance' I mean 'spectrum from object-level-empirical-work to realm-of-pure-abstraction).

I tried to capture this at the end of my comment, by saying that even success as I outlined it probably wouldn't change my all-things-considered view (because there's a whole suite of nested problems at other levels of abstraction, including the one you named), but it would at least update me toward the plausibility of the case they're making.

As is, their own tests say they're doing poorly, and they'll probably want to fix that in good faith before they try tackling the kind of dynamic group epistemic failures that you're pointing at.

A non-review of "If Anyone Builds It, Everyone Dies"

yams20d1310

Get to near-0 failure in alignment-loaded tasks that are within the capabilities of the model.

That is, when we run various safety evals, I'd like it if the models genuinely scored near-0. I'd also like it if the models ~never refused improperly, ~never answered when they should have refused, ~never precipitated psychosis, ~never deleted whole codebases, ~never lied in the CoT, and similar.

These are all behavioral standards, and are all problems that I'm told we'll keep under control. I'd like the capacity for us to have them under control demonstrated currently, as a precondition of advancing the frontier.

So far, I don't see that the prosaic plans work in the easier, near-term cases, and am being asked to believe they'll work in the much harder future cases. They may work 'well enough' now, but the concern is precisely that 'well enough' will be insufficient in the limit.

An alternative condition is 'full human interpretability of GPT-2 Small'.

This probably wouldn't change my all-things-considered view, but this would substantially 'modify my expectations', and make me think the world was much more sane than today's world.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments