Thanks for the feedback! I was quite surprised at the Claude results myself. I did play around a little bit with the prompt on Claude 3.5 Sonnet, and found that it could change the result on individual questions, but I couldn't get it to change the overall accuracy much that way -- other questions would also flip to refusal. So this certainly warrants further investigation, but by itself I wouldn't take it as evidence the overall result changes .
In fact, a friend of mine got Claude to answer questions quite consistently, and could only replicate the freque...
Sure, perhaps another example from Claude 3 Opus illustrates the point better:
... (read more)