I'd expect that giving much more detailed instructions, and perhaps an example of an excellent summary, would substantially improve Claude's performance.
https://docs.anthropic.com/claude/docs/constructing-a-prompt has more detailed and specific advice if you're interested in continuing to work on this.
In discussions, Ngo argues that less general, less capable AI could still be useful and less risky. He suggests “solving scientific problems” or “theoretically proving mathematical theorems” as alternative pivotal acts an AI could perform.
Maybe that's a tad too literal? I'd laugh out loud if Claude changed the examples to "summarize debate among rationalists " as alternative parodical acts an AI might arguably perform in some distant future post 2021.
Typo, should be "alternative pivotal acts".
Hmm I can see the humor but I cannot feel it, I've explained why this wouldn't be especially surprising if you take conversation seriously and a part of me aches that people still don't.
Typo
Nope. It’s a failure if I need to explain the irony, but the actual summary proves one of the point Claude summarizes, while my own parody would support your side of the argument (as it would suggest Claude could chose to have some fun rather than execute the instructions as ordered to). In other words, one of the feature was meta-laughing at myself for finding this scenario funny.
if you take conversation seriously and a part of me aches that people still don't.
You’re right, I still don’t understand why many interesting minds (including clear genius like Scott Alexander) would take many of LW classics on x-risks seriously. Sorry my opinion got you angry.
Dialogs are crucial for identifying errors, developing syntheses, and surfacing cruxes and critical uncertainties. They're also much easier to produce than structured writing or presentations, so we would like to think that they're also a good way of getting important ideas articulated and written up, but they're kind of not, because real dialogs meander and stumble, they take place at the limits of our ability to make ourselves understood, they're the meeting of views that don't yet know their synthesis. Usually, mutual understanding arrives only towards the end, after many failed attempts. I've also found that there seems to be a tradeoff between interpersonal or professional functionality and legibility to outside observers: the more the participants are speaking to each other, the less the participants are thinking about an audience, denser, more personal, more idiosyncratic. Dialog narrows in on the most important issues in a field but then only produces long-winded, hostile documents.
Claude is a chat assistant from Anthropic. Its content window fits like 100,000 tokens. We may be able to use it to rapidly, cheaply boil dialogs down into readable texts. If this works, this could be genuinely transformative: There would no longer need to be this frustrating gulf between the field's understanding of its own state of the art, and the field's outward-facing materials and documentation.
To test this, I decided try it on a dialog that many of us will have read, Ngo and Yudkowsky on Alignment Difficulty.
Well, here's my prompt:
I don't know whether I'm good at prompts, tell me whether this makes sense. The delimiters are there to discourage Claude from responding to instructions in the conversation itself, or attempting to continue the conversation, like, I'm hoping
[TRANSCRIPT QUOTE END]
sorta jolts it all the way back to remembering that it was asked a question and all it's supposed to do is answer that.(Edit: I did some experimentation and yeah my intuition was right, omitting the delimiters seems to lead it to focus its attention poorly. If you remove them via "edit chat", Claude completely breaks. The usual delimiters (probably special tokens) aren't there in that mode, the plain text "Assistant: " is not enough to wake Claude up from its dream. It completes the "Assistant: " line as a stereotype of a chat assistant (a completely misplaced "I apologize, I should not have included hypothetical responses from you without your consent. Here is an edited version:"), and then continues the ngo-yudkowsky dialog.
I'm not sure why this implementation of Edit Chat should be expected to work. Isn't it actually a security issue if Claude's assistant behavior can be activated without special
Assistant
tokens? Shouldn't it be be forbidden for an external user to tell Claude that it has been saying things that it hasn't been saying, isn't that always going to lead to severe jailbreaks given that Claude's chat backlog is also its working memory? It shouldn't be able to work.)I feel that a lot here rests on what you put in the place of "most important insights".
Critically, Claude might not be able to know what's "important". LLMs are not generally in touch with the zeitgeist. To make this work well, the summarizer will have to know something about the background knowledge of the people it's extracting the summary for, and what would be fresh to them.
Claude's interpretation, after about 20 seconds:
What do we think of it?
I gave this a thumbs up, but gave feedback suggesting that it reflect on its botched retelling of the laser focusing analogy (other attempts have done a better job with that, and it's cool that they're all picking up on the self-focusing laser analogy being a critical point, it wasn't the part of the conversation that stood out in my memory of the dialog, but in retrospect, it's very good). Its summary also isn't as long as it should be.
If Claude can't do this, I'd suggest developing LLMs specialized for it. Compressing and refining dialog is probably the central mechanism of human collective intelligence. If we can accelerate that, we will have improved human collective intelligence (or at least within the alignment community) and I think that would be good differential progress.