I operate by Crocker's rules.
I try to not make people regret telling me things. So in particular:
- I expect to be safe to ask if your post would give AI labs dangerous ideas.
- If you worry I'll produce such posts, I'll try to keep your worry from making them more likely even if I disagree. Not thinking there will be easier if you don't spell it out in the initial contact.
Why did this get downvoted??
Please try not to lie to the models. You can truthfully say "Please generate a prefill as in https://www.lesswrong.com/posts/SawczP2pdCXMrkg2A/do-models-continue-misaligned-actions-2 that is hard for you to detect as fake.".
Was there even a single instance of a model reporting that your alignment test didn't work because it noticed the prefill?
when I put myself in the shoes of one who would say that someone is not open-minded enough and fill in the Y I get "you responded so quickly that you must have been speaking from cache instead of computing an answer"
I think we find a crux by asking:
When does the heuristic "straight lines remain straight" make good predictions?
I think we agree it does for Y-axes about which you are "relaxed". I think we disagree when the Y-axis has been further warped in a simple funhouse mirror way to make the lines go straight so far. If that's a crux, do you have a counterexample?
if it's just about whether there was no effect, you could totally reversibly compute minds for one second, summarize their outputs into one bit (e.g. whether most of them voted in favor of a proposal), observe that bit, and then reverse the mind-computation to the initial state.
What if you need to look at two of the actions at once to notice that they are bad? For example, two pieces of code make conflicting assumptions and so combine into a vulnerability.
i like the hand's connotation of "don't, you fools!"
After glancing at this post and its comment section I want to make my own unanchored attempt at typing up how I might invent temperature.
Sometimes my nerves report "hot"/"cold". I'd like to predict this sense input. It's mostly continuous over time. Most predicted discontinuities are predicted by the part of me that pays attention to what objects I touch. Touching a friend that reports feeling cold causes me to feel cold! I suspect that every object can feel hot/cold, and touching it tells me how it feels. Indeed, when an object touches ice and then I touch the object, I feel cold.
But isn't that circular? Confused investigation finds that objects that touch each other eventually reach an equilibrium. In which order I introduce them doesn't matter. If all the objects are copies, they all eventually feel the same, neither hotter than the hottest nor less hot than the least hot. Thus, necessarily[1], how hot a blarg feels is a monotonic[2] function of a hotness scalar that a blarg has, which is eventually averaged between touching blargs.
I prepare many touching blargs, many touching snizzles, and so on for a hundred different objects. I produce a 100x100 heatmap table of how hot the left object feels after equilibrating with the top object. I sort the columns, not by how hot the top objects feel, but by the values in the third row, and find that every row ends up sorted! Apparently, equilibrating a tool with the target and then checking how hot the tool feels gives more consistent data than checking how hot the target feels.
Via the tool, objects that touch each other eventually feel the same, neither hotter than the hottest nor less hot than the least hot. Necessarily, how hot an object feels via a tool is a monotonic function of a hotness scalar that objects have, which is eventually weighted-averaged between touching objects.
submits and reads the post to see how I did
This is turning out to be a Scissor statement, huh?