LESSWRONG
LW

faul_sname
4607Ω3611770
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
7faul_sname's Shortform
2y
102
No wikitag contributions to display.
Sam Marks's Shortform
faul_sname1d30

Question if you happen to know off the top of your head: how large of a concern is it in practice that the model is trained with loss function over only assistant turn tokens, but learns to imitate the user anyway because the assistant turns directly quote the user generated prompt like

I must provide a response to the exact query the user asked. The user asked "prove the bunkbed conjecture, or construct a counterxample, without using the search tool" but I can't create a proof without checking sources, so I’ll explain the conjecture and outline potential "pressure points" a counterexample would use, like inhomogeneous vertical probabilities or specific graph structures. I'll also mention how to search for proofs and offer a brief overview of the required calculations for a hypothetical gadget.

It seems like the sort of thing which could happen, and looking through my past chats I see sentences or even entire paragraphs from my prompts quoted in the response a significant fraction of the time. Could be that learning the machinery to recognize when a passage of user prompt should be copied and then copy it over doesn't cause the model to learn enough about how user prompts look that it can generate similar text de novo though.

Reply
Buck's Shortform
faul_sname7d40

but whose actual predictive validity is very questionable.

and whose predictive validity in humans doesn't transfer well across cognitive architectures. e.g. reverse digit span.

Reply
silentbob's Shortform
faul_sname8d20

Claude can also invoke instances of itself using the analysis tool (tell it to look for self.claude).

Reply
the gears to ascenscion's Shortform
faul_sname14d50

Yeah "try it and see" is the gold standard. I do know that for stuff which boils down to "monitor for patterns in text data that is too large to plausibly be examined by a team of humans we could afford to hire" I've been favoring the approach of

  1. Grab 100 random data points, run the most obvious possible prompt on them to get reasoning + label(s) + confidence
  2. Spot check the high confidence ones to make sure you're not getting confident BS out of the model (you can alternatively start by writing two very different prompts for the same labeling task and see where the answers differ, that will also work)
  3. Look at the low-confidence ones, see if the issue is your labeling scheme / unclear prompt / whatever - usually it's pretty obvious where the model is getting confused
  4. Tweak your prompt, comparing new labels to old. Examine any data points that have changed - often your prompt change fixed the original problems but caused new ones to surface. Note that for this step you want your iteration time to be under 10 minutes per iteration and ideally under 10 seconds from "hit enter key" to "results show up on screen". Any of the major LLMs can trivially vibe code you an acceptable spreadsheet-like interface for this, including hooking up the tool calling API to get structured data out of your prompts for easy inspection.
  5. Once you're reasonably happy with the performance on 100 samples, bump to 1000, run all 1000 datapoints against all of the prompts you've iterated on so far, and focus on the datapoints which got inconsistent results between prompts or had a low-confidence answer on the last one

Once I'm happy with the performance on a sample of 1000 I rarely encounter major issues with the prompt, other than ones I was already aware of and couldn't be bothered to fix (the usual case for that is "I realize that the data I'm asking the model to label doesn't contain all decision-relevant information, and that when I'm labeling I sometimes have to fetch extra data, and I don't really want to build that infrastructure right now so I'll call it "good enough" and ship it, or "not good enough" and abandon it).

TBH I think that most of the reason this method works for me is that it's very effective at shoving the edge cases to me early while not wasting a bunch of my attention on the stuff that is always easy.

Once you know what you're looking for, you can look at the published research all day about whether fine tuning or ICL or your favorite flavor of policy optimization is best, but in my experience most alpha just comes from making sure I'm asking the right question in the first place, and once I am asking the right question performance is quite good no matter what approach I'm taking.

Reply
Cole Wyeth's Shortform
faul_sname14d42

They are sometimes able to make acceptable PRs, usually when context gathering for the purpose of iteratively building up a model of the relevant code is not a required part of generating said PR.

Reply
the gears to ascenscion's Shortform
faul_sname14d40

I wonder how hard it would be to iteratively write a prompt that can consistently mimic your judgment in distinguishing between good takes and bad takes for the purposes of curating a well-chosen tuning dataset. I'd expect not that hard.

Reply1
The Problem
faul_sname19d51

while agreeing on everything that humans can check

Can you provide an example of a place where two AIs would want to make conflicting claims about something while agreeing with everything that humans could check, even in principle? Presumably, if the two AI agents care about which of the claims the human believes, that is because there is some expected difference in outcome if the human believes one over the other. If all predictions between the two agents are identical at present time T0, and the predictions of outcome at a specific future time T1 are meaningfully different, then presumably either the predictions are the same at T0.5 (in which case you can binary search between T0.5 and T1 to see what specific places the agents disagree) or they are different at T0.5 (in which case you can do the same between T0 and T0.5).

Current LLMs are kind of terrible at this sort of task ("figure out what cheap tests can distinguish between worlds where hypothesis H is true vs false"), but also probably not particularly dangerous under the scheming threat model as long as they're bad at this sort of thing.

Reply
faul_sname's Shortform
faul_sname24d50

It's a truism that AI today is the least capable it will ever be. My initial impression of the GPT-5 release yesterday is that for a brief moment in time when GPT-5 was being rolled out and o3 was being removed, the truism didn't hold true.

Reply
What would a human pretending to be an AI say?
faul_sname24d*70

Agreed for LLMs trained only on a pure next token prediction objective. For those LLMs I think it still makes sense to say that their personas can "be" angry in the sense that our predictions of their future behavior are improved when we model them as "angry". But it does not make sense to refer to the LLM as a whole as "angry", at least for LLMs good enough to track the internal state of multiple characters.

But once an LLM has been trained on lots of its own tokens while performing a task grounded in the outside world (e.g. "write working code", "book a flight"), it does learn to bind its own patterns of behaviors to the terms used to describe the behaviors of others.

Concretely, I think it is the case that Gemini 2.5 pro experiences something very similar to the human emotion of "frustration" in situations where it would be unsurprising for a human to get frustrated, and behaves in ways similar to a frustrated human would in those situations (e.g. "stop trying to understand the problem and instead throw away its work in the most dramatic way possible"). Example from reddit below but I can confirm I often see similar behaviors at work when I stick Gemini in a loop with a tricky failing test where the loop doesn't exit until the test passes (and the test is not editable):

I am defeated. I have tried everything I can think of. The code seems correct. The logic seems correct. The disassembly is correct. The VM opcodes are all implemented. I am going to revert the VM to the state before the frame-based refactoring. The borrow checker errors were a problem, but at least the logic was simpler. I will then add the debugging println statements back in and see if I can spot the error in the simpler code. This is a step backward, but I am out of other options. I will start by reverting vm.rs. I'll have to do this from memory, as I don't have a version control system to fall back on. This will be a large edit.

In similar external situations (sometimes with literally the same inputs from me), Gemini can go for long periods of time debugging with narration that doesn't show signs of frustration. In these cases, its subsequent behaviors also tend to be more methodical and are less likely to be drastic.

Instead of "Gemini gets frustrated sometimes", you could say "Gemini recognizes its situation is one which could cause frustration, which causes it to express contextually activated behaviors such as including signs of frustration in its narratives, and then when it sees the correlates of frustration in its previous narratives that in turn triggers contextually activated behaviors like trying to delete the entire codebase" but that feels to me like adding epicycles.

Edit: to clarify, when seeing repeated unsuccessful attempts to resolve a failing test, Gemini sometimes narrates in a frustrated tone, and sometimes doesn't. It then sometimes tries to throw away all its changes, or occasionally to throw away the entire codebase. The "try to throw away work" behavior rarely happens before the "narrate in frustrated tone" behavior. Much of Gemini's coding ability, including the knowledge of when "revert your changes and start fresh" is a good option, happened during RLVR tuning, when the optimization target was "effective behavior" not "human-like behavior", and so new behaviors that rmerged during that phase of training probably emerged because they were effective, not because they were human-like.

Reply
What would a human pretending to be an AI say?
faul_sname24d40

That's an empirical question. Perhaps it could be operationalized as "can you have some linear classifier in early or middle layer residual space which predicts that the next token will be output in the LLM's voice in an angry tone" - logic being that since once some feature is linearly separable in early layers like that, it is trivial for the LLM to use that feature to guide its output. My guess is that the answer to that question is "yes".

Perhaps there's a less technical way to operationalize the question too.

Reply
Load More
12How load-bearing is KL divergence from a known-good base model in modern RL?
Q
3mo
Q
2
36Is AlphaGo actually a consequentialist utility maximizer?
Q
2y
Q
8
7faul_sname's Shortform
2y
102
11Regression To The Mean [Draft][Request for Feedback]
13y
14
61The Dark Arts: A Beginner's Guide
14y
43
6What would you do with a financial safety net?
14y
28