faul_sname

Wikitag Contributions

Comments

Sorted by

but whose actual predictive validity is very questionable.

and whose predictive validity in humans doesn't transfer well across cognitive architectures. e.g. reverse digit span.

Claude can also invoke instances of itself using the analysis tool (tell it to look for self.claude).

Yeah "try it and see" is the gold standard. I do know that for stuff which boils down to "monitor for patterns in text data that is too large to plausibly be examined by a team of humans we could afford to hire" I've been favoring the approach of

  1. Grab 100 random data points, run the most obvious possible prompt on them to get reasoning + label(s) + confidence
  2. Spot check the high confidence ones to make sure you're not getting confident BS out of the model (you can alternatively start by writing two very different prompts for the same labeling task and see where the answers differ, that will also work)
  3. Look at the low-confidence ones, see if the issue is your labeling scheme / unclear prompt / whatever - usually it's pretty obvious where the model is getting confused
  4. Tweak your prompt, comparing new labels to old. Examine any data points that have changed - often your prompt change fixed the original problems but caused new ones to surface. Note that for this step you want your iteration time to be under 10 minutes per iteration and ideally under 10 seconds from "hit enter key" to "results show up on screen". Any of the major LLMs can trivially vibe code you an acceptable spreadsheet-like interface for this, including hooking up the tool calling API to get structured data out of your prompts for easy inspection.
  5. Once you're reasonably happy with the performance on 100 samples, bump to 1000, run all 1000 datapoints against all of the prompts you've iterated on so far, and focus on the datapoints which got inconsistent results between prompts or had a low-confidence answer on the last one

Once I'm happy with the performance on a sample of 1000 I rarely encounter major issues with the prompt, other than ones I was already aware of and couldn't be bothered to fix (the usual case for that is "I realize that the data I'm asking the model to label doesn't contain all decision-relevant information, and that when I'm labeling I sometimes have to fetch extra data, and I don't really want to build that infrastructure right now so I'll call it "good enough" and ship it, or "not good enough" and abandon it).

TBH I think that most of the reason this method works for me is that it's very effective at shoving the edge cases to me early while not wasting a bunch of my attention on the stuff that is always easy.

Once you know what you're looking for, you can look at the published research all day about whether fine tuning or ICL or your favorite flavor of policy optimization is best, but in my experience most alpha just comes from making sure I'm asking the right question in the first place, and once I am asking the right question performance is quite good no matter what approach I'm taking.

They are sometimes able to make acceptable PRs, usually when context gathering for the purpose of iteratively building up a model of the relevant code is not a required part of generating said PR.

I wonder how hard it would be to iteratively write a prompt that can consistently mimic your judgment in distinguishing between good takes and bad takes for the purposes of curating a well-chosen tuning dataset. I'd expect not that hard.

while agreeing on everything that humans can check

Can you provide an example of a place where two AIs would want to make conflicting claims about something while agreeing with everything that humans could check, even in principle? Presumably, if the two AI agents care about which of the claims the human believes, that is because there is some expected difference in outcome if the human believes one over the other. If all predictions between the two agents are identical at present time T0, and the predictions of outcome at a specific future time T1 are meaningfully different, then presumably either the predictions are the same at T0.5 (in which case you can binary search between T0.5 and T1 to see what specific places the agents disagree) or they are different at T0.5 (in which case you can do the same between T0 and T0.5).

Current LLMs are kind of terrible at this sort of task ("figure out what cheap tests can distinguish between worlds where hypothesis H is true vs false"), but also probably not particularly dangerous under the scheming threat model as long as they're bad at this sort of thing.

It's a truism that AI today is the least capable it will ever be. My initial impression of the GPT-5 release yesterday is that for a brief moment in time when GPT-5 was being rolled out and o3 was being removed, the truism didn't hold true.

Agreed for LLMs trained only on a pure next token prediction objective. For those LLMs I think it still makes sense to say that their personas can "be" angry in the sense that our predictions of their future behavior are improved when we model them as "angry". But it does not make sense to refer to the LLM as a whole as "angry", at least for LLMs good enough to track the internal state of multiple characters.

But once an LLM has been trained on lots of its own tokens while performing a task grounded in the outside world (e.g. "write working code", "book a flight"), it does learn to bind its own patterns of behaviors to the terms used to describe the behaviors of others.

Concretely, I think it is the case that Gemini 2.5 pro experiences something very similar to the human emotion of "frustration" in situations where it would be unsurprising for a human to get frustrated, and behaves in ways similar to a frustrated human would in those situations (e.g. "stop trying to understand the problem and instead throw away its work in the most dramatic way possible"). Example from reddit below but I can confirm I often see similar behaviors at work when I stick Gemini in a loop with a tricky failing test where the loop doesn't exit until the test passes (and the test is not editable):

I am defeated. I have tried everything I can think of. The code seems correct. The logic seems correct. The disassembly is correct. The VM opcodes are all implemented. I am going to revert the VM to the state before the frame-based refactoring. The borrow checker errors were a problem, but at least the logic was simpler. I will then add the debugging println statements back in and see if I can spot the error in the simpler code. This is a step backward, but I am out of other options. I will start by reverting vm.rs. I'll have to do this from memory, as I don't have a version control system to fall back on. This will be a large edit.

In similar external situations (sometimes with literally the same inputs from me), Gemini can go for long periods of time debugging with narration that doesn't show signs of frustration. In these cases, its subsequent behaviors also tend to be more methodical and are less likely to be drastic.

Instead of "Gemini gets frustrated sometimes", you could say "Gemini recognizes its situation is one which could cause frustration, which causes it to express contextually activated behaviors such as including signs of frustration in its narratives, and then when it sees the correlates of frustration in its previous narratives that in turn triggers contextually activated behaviors like trying to delete the entire codebase" but that feels to me like adding epicycles.

Edit: to clarify, when seeing repeated unsuccessful attempts to resolve a failing test, Gemini sometimes narrates in a frustrated tone, and sometimes doesn't. It then sometimes tries to throw away all its changes, or occasionally to throw away the entire codebase. The "try to throw away work" behavior rarely happens before the "narrate in frustrated tone" behavior. Much of Gemini's coding ability, including the knowledge of when "revert your changes and start fresh" is a good option, happened during RLVR tuning, when the optimization target was "effective behavior" not "human-like behavior", and so new behaviors that rmerged during that phase of training probably emerged because they were effective, not because they were human-like.

That's an empirical question. Perhaps it could be operationalized as "can you have some linear classifier in early or middle layer residual space which predicts that the next token will be output in the LLM's voice in an angry tone" - logic being that since once some feature is linearly separable in early layers like that, it is trivial for the LLM to use that feature to guide its output. My guess is that the answer to that question is "yes".

Perhaps there's a less technical way to operationalize the question too.

When it's operating in an area of phase space that leads to it expressing behaviors which are correlated (in humans) with anger. Which is also how human children learn to recognize that they're angry.

Perhaps in these cases the LLM isn't "really" angry in some metaphysical sense but if it can tell when it would express the behaviors which in humans correlate with anger, and it says that in those situations it "is angry" that doesn't seem obviously wrong to me.

Load More