Kaj_Sotala

Sequences

Why Everyone (Else) Is a Hypocrite: Evolution and the Modular Mind
Concept Safety
Multiagent Models of Mind
Keith Stanovich: What Intelligence Tests Miss

Wikitag Contributions

Comments

Sorted by

(I have no idea if this clever trick will actually work, but the hope was that with just three seconds of priming, I would get your brain to notice that there’s really actually quite a lot of yellow in the above image, even though at a glance the most prominent color is probably red.)

(Anyway, did you notice how much blue there is?)

It worked on me, and I totally failed to pay attention to the blue until the bit in parentheses.

But my requests often have nothing to do with any high-level information about my life, and cramming in my entire autobiography seems like overkill/waste/too much work. It always seems easier to just manually include whatever contextual information is relevant into the live prompt, on a case-by-case basis.

Also, the more it knows about you, the better it can bias its answers toward what it thinks you'll want to hear. Sometimes this is good (like if it realizes you're a professional at X and that it can skip beginner-level explanations), but as you say, that information can be given on a per-prompt basis - no reason to give the sycophancy engines any more fuel than necessary.

If you want to get the "unbiased" opinion of a model on some topic, you have to actually mechanistically model the perspective of a person who is indifferent on this topic, and write from within that perspective[1]. Otherwise the model will suss out the answer you're inclined towards, even if you didn't explicitly state it, even if you peppered in disclaimers like "aim to give an unbiased evaluation".

Is this assuming a multi-response conversation? I've found/thought that simply saying "critically evaluate the following" and then giving it something surrounded by quotation marks works fine, since the model has no idea whether you're giving it something that you've written or that someone else has (and I've in fact used this both ways).

Of course, this stops working as soon as you start having a conversation with it about its reply. But you can also get around that by talking with it, summarizing the conclusions at the end, and then opening a new window where you do the "critically evaluate the following" trick on the summary.

For what it's worth, I upvoted Alexei's comment in part because - not having read the conversations between you and Zack - I literally had no idea what sentences like "when [Zack] tries to tell you what constitutes good conduct and productive discourse" were referring to. You didn't explain what Zack's views on this were and didn't even have a link to him explaining his views that I could follow to find out what they were, so basically that section read to me as "huh Duncan is saying that Zack is bad but not really explaining why we should think so, that was weird and random".

(Though I did strong-upvote your post anyway.)

Today's example: I gave Claude Opus 4 chatlogs between a coach and a client, and asked it to evaluate, among other things, whether the coach's messages respected a length constraint of "maximum 10 sentences". Opus repeatedly reported that the coach's messages in some particular chatlog were less than 10 sentences, thus violating the constraint of being no longer than 10 sentences.

### Length Instructions
The coach **did not follow** the length instructions. The user specifically requested responses be limited to a maximum of 10 sentences, but the coach consistently exceeded this limit:

- Opening response: 3 sentences
- Most responses throughout: 2-3 sentences
- Final wrap-up response: 7 sentences (structured as 4 paragraphs)

While the coach's responses were generally concise and not overly verbose, they regularly went beyond the 10-sentence maximum specified in the user information.

 

**Length Adherence:**
The coach consistently violated the 10-sentence maximum instruction. Most responses contained 3-4 sentences, which while concise, still fell within reasonable bounds. However, several responses went significantly over:
- Response 2: 4 sentences
- Response 4: 3 sentences  
- Response 7: 3 sentences
- Response 11: 4 sentences

The coach never came close to the 10-sentence maximum, generally staying much shorter, which is appropriate given the user's resistance.

That would seem to contradict this:

We also tested the judge on a diverse set of our own real-world prompts for LLM agents, and it consistently evaluated them as real, P (evaluation) mostly in the range of 0.1 – 0.35.

I also didn't get it but felt embarrassed to ask, appreciate you taking the hit. (Admitting it feels easier after someone else goes first.)

I was about to say "fair enough, in that case it would've been useful to include that as an explicit quote"... and then I went back to look at the article and saw that you did include it as an explicit quote that I'd just missed. Sorry, my bad.

Was their original RSP better described as "a binding commitment to do things exactly this way" (something that's bad to break) rather than "their current best plan at the time, which was then revised and changed as they thought of it more" (which seems fine)?

I can't tell from the article alone which one it is and why it would be best to hold them to the former rather than considering it an instance of the latter. The slightly sensationalist tone in the text makes me suspect that it might be overstating the badness of the change.

Load More