From London, now living in the Santa Cruz mountains.
Not being able to figure out what sort of thing humans would rate highly isn't an alignment failure, it's a capabilities failure, and Eliezer_2008 would never have assumed a capabilities failure in the way you're saying he would. He is right to say that attempting to directly encode the category boundaries won't work. It isn't covered in this blog post, but his main proposal for alignment was always that as far as possible, you want the AI to do the work of using its capabilities to figure out what it means to optimize for human values rather than trying to directly encode those values, precisely so that capabilities can help with alignment. The trouble is that even pointing at this category is difficult - more difficult than pointing at "gets high ratings".
I'm not quite seeing how this negates my point, help me out?
In this instance the problem the AI is optimizing for isn't "maximize smiley faces", it's "produce outputs that human raters give high scores to". And it's done well on that metric, given that the LLM isn't powerful enough to subvert the reward channel.
I'm sad that the post doesn't go on to say how to get matplotlib to do the right thing in each case!
I thought you wanted to sign physical things with this? How will you hash them? Otherwise, how is this different from a standard digital signature?
The difficult thing is tying the signature to the thing signed. Even if they are single-use, unless the relying party sees everything you ever sign immediately, such a signature can be transferred to something you didn't sign from something you signed that the relying party didn't see.
Of course this market is "Conditioning on Nonlinear bringing a lawsuit, how likely are they to win?" which is a different question.
Extracted from a Facebook comment:
I don't think the experts are expert on this question at all. Eliezer's train of thought essentially started with "Supposing you had a really effective AI, what would follow from that?" His thinking wasn't at all predicated on any particular way you might build a really effective AI, and knowing a lot about how to build AI isn't expertise on what the results are when it's as effective as Eliezer posits. It's like thinking you shouldn't have an opinion on whether there will be a nuclear conflict over Kashmir unless you're a nuclear physicist.
Thanks, that's useful. Sad to see no Eliezer, no Nate or anyone from MIRI or having a similar perspective though :(
Also Rosie Campbell https://x.com/RosieCampbell/status/1863017727063113803