Not being able to figure out what sort of thing humans would rate highly isn't an alignment failure, it's a capabilities failure, and Eliezer_2008 would never have assumed a capabilities failure in the way you're saying he would. He is right to say that attempting to directly encode the category boundaries won't work. It isn't covered in this blog post, but his main proposal for alignment was always that as far as possible, you want the AI to do the work of using its capabilities to figure out what it means to optimize for human values rather than trying to directly encode those values, precisely so that capabilities can help with alignment. The trouble is that even pointing at this category is difficult - more difficult than pointing at "gets high ratings".

Magical Categories

Paul Crowley4mo114

I'm not quite seeing how this negates my point, help me out?

Eliezer sometimes spoke of AIs as if they had "reward channel"
But they don't, instead they are something a bit like "adaption executors, not fitness maximizers"
This is potentially an interesting misprediction!
Eliezer also said that if you give the AI the goal of maximizing smiley faces, it will make tiny molecular ones
TurnTrout points out that if you ask an LLM if that would be a good thing to do, it says no
My point is that this is exactly what Eliezer would have predicted for an LLM whose reward channel was "maximize reader scores"
Our LLMs tend to produce high reader scores for a reason that's not exactly "they're trying to maximize their reward channel"
I don't at all see how this difference makes a difference! Eliezer would always have predicted that an AI aimed at maximizing reader scores would have produced a response to TurnTrout's question that maximized reader scores, so it's silly to present them doing so as a gotcha!

Magical Categories

Paul Crowley4mo117

In this instance the problem the AI is optimizing for isn't "maximize smiley faces", it's "produce outputs that human raters give high scores to". And it's done well on that metric, given that the LLM isn't powerful enough to subvert the reward channel.

Using axis lines for good or evil

Paul Crowley1y86

I'm sad that the post doesn't go on to say how to get matplotlib to do the right thing in each case!

Nathan Helm-Burger's Shortform

Paul Crowley1y30

I thought you wanted to sign physical things with this? How will you hash them? Otherwise, how is this different from a standard digital signature?

Nathan Helm-Burger's Shortform

Paul Crowley1y90

The difficult thing is tying the signature to the thing signed. Even if they are single-use, unless the relying party sees everything you ever sign immediately, such a signature can be transferred to something you didn't sign from something you signed that the relying party didn't see.

Effective Aspersions: How the Nonlinear Investigation Went Wrong

Paul Crowley1y1813

Of course this market is "Conditioning on Nonlinear bringing a lawsuit, how likely are they to win?" which is a different question.

Paul Crowley's Shortform

Paul Crowley2y70

Extracted from a Facebook comment:

I don't think the experts are expert on this question at all. Eliezer's train of thought essentially started with "Supposing you had a really effective AI, what would follow from that?" His thinking wasn't at all predicated on any particular way you might build a really effective AI, and knowing a lot about how to build AI isn't expertise on what the results are when it's as effective as Eliezer posits. It's like thinking you shouldn't have an opinion on whether there will be a nuclear conflict over Kashmir unless you're a nuclear physicist.

Campaign for AI Safety: Please join me

Paul Crowley2y20

Thanks, that's useful. Sad to see no Eliezer, no Nate or anyone from MIRI or having a similar perspective though :(