jwfiredragon7h10

The disconnect between stated and revealed preferences across all models suggests frontier models either misrepresent their actual preferences or lack accurate self-understanding of their own behaviors—both concerning possibilities for alignment research.

Bit of a tangent, but I suspect the latter. As part of my previous project (see "Round 2"), I asked GPT-4o to determine if a code sample was written normally by another instance of GPT-4o, or if it contained a suspicious feature. On about 1/5 of the "normal" samples, GPT-4o incorrectly asserted that the code contained an unusual feature that it would not typically generate.

I'd guess that current-generation LLMs don't really have self-understanding. When asked how they would behave in a situation, they just provide a plausible-sounding completion instead of actually trying to model their own behaviour. If this is true, it would mean that alignment research on LLMs might not generalize to smarter AIs that have self-understanding, which is indeed concerning.

I'd also expect this to be something that has already been researched. I'll do some digging, but if anyone else knows a paper/post/etc on this topic, please chime in.

Probability Theory Fundamentals 102: Territory that Probability is in the Map of

jwfiredragon7h30

I'm supposed to account for all the relevant information and ignore all the irrelevant.

Is there a formal way you'd define this? My first attempt is something like "information that, if it were different, would change my answer". E.g. knowing the coin is biased 2:1 vs 3:1 doesn't change your probability, so it's irrelevant; knowing the coin is biased 2:1 for heads vs 2:1 for tails changes your probability, so it's relevant.

Or maybe it should be considered from the perspective of reducing the sample space? Is knowing the coin is biased vs knowing it's biased 2:1 a change in relevant information, even though your probability remains at 1/2, because you removed all other biased coins from the sample space? (Intuitively this feels less correct, but I'm writing it out in the interest of spitballing ideas)

Unrelatedly, would you agree that there's not really a meaningful difference between logical and physical uncertainties? I see both of them as stemming from a lack of knowledge - logical uncertainty is where you could find the answer in principle but haven't done so; physical uncertainty is where you don't know how to find the answer. But in practice there's a continuum of how much you know about the answer-finding process for any given problem, so they blur together in the middle.

Untrusted monitoring insights from watching ChatGPT play coordination games

jwfiredragon20d10

Ok, I've been thinking about this more and I've figured out what exactly was bugging me about this red-teaming scheme. Is it fair to say you assume that your red team will naturally cover the same area of solution-space as actual malicious AIs aiming to self-coordinate?

If yes, I think that assumption is faulty. Self-coordinating AIs will have access to classes of strategies that a human red-teamer would not think of unless they spent significant effort to "think like the enemy", so to speak. These strategies will be usable even if the AI isn't much smarter than a human, simply because the AI is a very different kind of mind.

Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers [... but] I would bet against it

As a simple but hopefully demonstrative example: if the AI is asked to generate a list of items, it could communicate information based on the order of items in the list. Given that the AI is specifically attempting to self-coordinate, it can assume its other copy will come up with the same mapping between sequences and secret signals.^[1] This easily survives sentence-level paraphrasing, even though it's obviously defeated by shuffling the list.

There are plausibly very many strategies in this vein, where you encode information at the structural rather than the content level. So even if each individual strategy is trivial to defeat, you could just end up playing strategy whack-a-mole.

^{^}
The information density of this is low though - the AI can't assign meaningful information to too many sequences, since a synthetic input could then generate one by chance, so it can only make a sparse subset of sequences meaningful.

Untrusted monitoring insights from watching ChatGPT play coordination games

jwfiredragon2mo30

Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers

Yeah, this is the case I'm thinking about. I think on longer outputs it will be possible for the AI to use a more nebulous sense of "style", rather than only relying on specific features like "using X secret phrase" (which are presumably easier to redact away).

I think this could be countered by paraphrasing at a higher level - i.e. rewriting entire paragraphs from scratch rather than just rewording sentences. I'm noting this down as a potential topic if I have time to do another project.

Thanks for the discussion, it's helped me clear up my thoughts on this topic.

Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals

jwfiredragon2mo60

it needs the ability to recursively handle instrumental subgoals anyway, so why have a whole extra different kind of goal at the top?

I'm picturing this somewhat differently. I imagine each goal as a node in a directed graph, with zero or more arrows pointing toward its supergoal(s). Under this construction, we have a natural separation between instrumental goals (nodes with at least one outgoing arrow) and terminal goals (nodes with no outgoing arrows), even though they're not fundamentally different "types" of things.

So the question of "how do you build an agent with no terminal goals?" translates to "how do you build a directed graph with no dead ends?" And obviously the solution is that the graph must contain cycles (and specifically nothing but cycles). In the simplest case, you get an agent that wants to do A in order to do B, and wants to do B in order to do A.

I suspect this kind of structure shouldn't occur in humans because we're naturally born with certain base drives. So even if our goal graphs have cycles, the subgoals still lead to a dead end somewhere. E.g. when I play video games, I want to kill monsters to get experience, and I want to get experience so I can get better at killing monsters, but ultimately it's grounded in some basic desire for achievement or something.

However with an artificial mind, we may be able to tweak its goal graph directly and prune off all the dead ends, or link them back into other nodes.

(epistemic status: unsure if this idea actually has merit or if I'm taking the graph metaphor too far)

Untrusted monitoring insights from watching ChatGPT play coordination games

jwfiredragon2mo10

Hi, thank you very much for the response! I read through the post you linked, and also the steganography/paraphrasing post linked therein, and I see your point about redaction. I agree that in this intelligence regime, it's likely that we could come up with redaction schemes that very effectively prevent AIs from hiding information in their outputs.

However, I still have some concerns about generating synthetic inputs. Specifically this section:

At any point where it naively seems like there might be an extremely good opportunity to behave badly, or where you need to coordinate with another instance of yourself, it's overwhelmingly likely that this instance is in adversarial training/evaluation.

"Coordinating with yourself" is a special case of "coordinating with another AI" that deserves extra attention, because by default the AI has a strong incentive to coordinate with itself, but less so to coordinate with a different AI. I still think it'll be hard to convincingly create synthetic inputs for this case because doing so would require the blue team to know how the AI would act - otherwise the synthetic inputs end up as "coordinating with someone pretending to be yourself", or "coordinating with someone else".^[1] In those cases it could be natural for the AI to not cooperate, but that wouldn't necessarily imply that the AI wouldn't cooperate with a copy of itself.

If you want to claim this isn't a problem, you'd have to argue one of:

Creating synthetic inputs for self-coordination is not significantly harder than for other-coordination.
Defecting on other-coordination does imply the AI wouldn't cooperate with itself.
Not specifically testing for self-coordination is an acceptable weakness - if the redaction scheme is good, the AI won't know if its inputs are coming from itself anyways.^[2]
None of the above, there's another case I didn't consider.

I think 1 and 2 are false, and 3 requires some shaky reasoning (see the footnote).

(A tangent: I think current-generation LLMs might do better on noticing self-coordination with more hand-holdy prompting: e.g. asking the monitor to first generate its own solution to the same problem, and then comparing that against its input to check for discrepancies that could be secret signals.)

^{^}
One other possibility is to aim for "coordinating with someone that happens to share your goals", but that runs into the problems of A) communicating the extra information of shared goals and B) striking a balance between the offer being too good to be true and not good enough to bother.
^{^}
I think this makes sense at a first glance but is the kind of mindset that leads to insufficient paranoia, so I'd like to avoid it unless the case for it is really watertight.

The Solution to Sleeping Beauty

jwfiredragon6mo10

First off, thank you for this series. You've thoroughly dissolved the paradox for me. I was initially a thirder despite suspicions about Elga's original argument (he seemed to equivocate between being awakened in general and being first awakened/awakened on Monday), because the probabilities for Lewis' halfer argument (the 1/2, 1/4, 1/4 split) made no sense no matter how I did the math. I now understand that both sides were making the same error in their models, and that while the thirder position is correctly measuring something, it's not your credence in the coin landing heads.

Second, as I was thinking about this solution, I came up with an isomorphic problem that might better clarify the key insight of Tails&Monday and Tails&Tuesday being the same event. I don't think it's worth a whole post, but I'd like your thoughts on it.

The road trip rest stop

You and a friend are taking a late night road trip from Townsburg to Cityville to catch a concert. Your friend graciously offers to drive the whole way, so you decide to go to sleep.

Before departing, your friend tells you that he'd like to stop at rest stops A and B on the way, but if the traffic out of Townsburg is congested (which you expect at a 50% chance), he'll only have time to stop at A. Either way, he promises to wake you up at whatever rest stop(s) he stops at.

You're so tired that you fall asleep before seeing how the traffic is, so you don't know which rest stops your friend is going to stop at.

Some time during the night, your friend wakes you up at a rest stop. You're groggy enough that you don't know if you've been woken up for another rest stop previously, and it's too dark to tell whether you're at A or B. At this point, what is your credence that the traffic was bad?

"... than average" is (almost) meaningless

jwfiredragon9mo10

and have no particular reason to think those people are very unusual in terms of cooking-skill

Yeah, that's what I was trying to get at with the typical-friend-group stuff. The people you know well aren't a uniform sample of all people, so you have no reason to conclude that their X-skill is normal for any arbitrary X.

"... than average" is (almost) meaningless

jwfiredragon9mo10

The problem (or maybe just my problem) is that when I say "average" it feels like it's activating my concept of "mathematical concept of sum/count", even though the actual thing I'm thinking of is "typical member of class extracted from my mental model". I find myself treating "average" as if it came from real data even if it didn't.

Beware the suboptimal routine

jwfiredragon1y12

(including just using the default settings, or uninstalling the app and choosing an easier one).

I feel like this was meant to be in jest, but if you were serious: my goal was to beat the stage with certain self-imposed constraints, so relaxing those constraints would've defeated the purpose of the exercise.

Anyways, it wasn't my friend's specific change that was surprising, it was the fact that he had come up with an optimization at all. My strategy was technically viable but incredibly tedious, and even though I knew it was tedious I was planning on trying it as-is anyways. It was like I was committed to eating a bowl of soup with a fork and my friend pointed out the existence of spoons. The big realization was that I had a dumb strategy that I (on some level) knew was dumb, but I didn't have the "obvious" next thought of trying to make it better.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

The road trip rest stop