LESSWRONG
LW

All of Jan's Comments + Replies

Ah, yes, definitely doesn’t apply in that situation in full generality! :) Thanks for engaging!

Oh yeah, should have added a reference for that!

The intuition is that the defender (model provider) has to prepare against all possible attacks, while the defender can take the defense as given and only has to find one attack that works. And in many cases that actually formalises into an exponential-linear relationship. There was a Redwood paper where reducing the probability of generating a jailbreak randomly by an order of magnitude only increases the time it takes contractors to discover one by a constant amount. I also worked out some theory here but that was quite messy.

3Algon22d

I see. I was confused because e.g. in a fight this certainly doesn't seem true. If your tank's plating is suddenly 2^10 times stronger, that's a huge deal and requires 2^10 times stronger offense. Realistically, of course, it would take less as you'd invest in cheaper ways of disabling the tank than increasing firepower. But probably not logarithmically fewer!

Jan's Shortform

Jan22d278

Nostalgebraist’s new essay on… many things? AI ontology? AI soul magic?

The essay starts similarly to Janus’ simulator essay by explaining how LLMs are trained via next-token prediction and how they learn to model latent properties of the process that produced the training data. Nostalgebraist then applies this lens to today’s helpful assistant AI. It’s really weird for the network to predict the actions of a helpful assistant AI when there is literally no data about that in the training data. The behavior of the AI is fundamentally underspecified and only ... (read more)

nostalgebraist21d269

Hey Jan, thanks for the response.

@Garrett Baker's reply to this shortform post says a lot of what I might have wanted to say here, so this comment will narrowly scoped to places where I feel I can meaningfully add something beyond "what he said."

First:

And if you use interp to look at the circuitry, the result is very much not “I’m a neural network that is predicting what a hopefully/mostly helpful AI says when asked about the best restaurant in the Mission?”, it’s just a circuit about restaurants and the Mission.

Could you say more about what interp results... (read more)

2Alex Semendinger21d

Are you referring to Anthropic's circuit tracing paper here? If so, I don't recall seeing results that demonstrate it *isn't* thinking about predicting what a helpful AI would say. Although I haven't followed up on this beyond the original paper.

4MondSemmel21d

Nostalgebraist linkposted his essay on LW, in case you want to comment directly on the post.

Garrett Baker22d1710

It underestimates the effect of posttraining. I think the simulator lens is very productive when thinking about base models but it really struggles at describing what posttraining does to the base model. I talked to Janus about this a bunch back in the day and it’s tempting to regard it as “just” a modulation of that base model that upweights some circuits and downweights others. That would be convenient because then simulator theory just continues to apply, modulo some affine transformation.

To be very clear here, this seems straightforwardly false. The en... (read more)

2Algon22d

QRD?

Jan's Shortform

Jan2y*120

Neuroscience and Natural Abstractions

Similarities in structure and function abound in biology; individual neurons that activate exclusively to particular oriented stimuli exist in animals from drosophila (Strother et al. 2017) via pigeons (Li et al. 2007) and turtles (Ammermueller et al. 1995) to macaques (De Valois et al. 1982). The universality of major functional response classes in biology suggests that the neural systems underlying information processing in biology might be highly stereotyped (Van Hooser, 2007, Scholl et al. 2013). In line with this h... (read more)

[Simulators seminar sequence] #2 Semiotic physics - revamped

Jan2yΩ330

Hi, thanks for the response! I apologize, the "Left as an exercise" line was mine, and written kind of tongue-in-cheek. The rough sketch of the proposition we had in the initial draft did not spell out sufficiently clearly what it was I want to demonstrate here and was also (as you point out correctly) wrong in the way it was stated. That wasted people's time and I feel pretty bad about it. Mea culpa.

I think/hope the current version of the statement is more complete and less wrong. (Although I also wouldn't be shocked if there are mistakes in there). Regar... (read more)

[Simulators seminar sequence] #2 Semiotic physics - revamped

Jan2yΩ110

Hmm there was a bunch of back and forth on this point even before the first version of the post, with @Michael Oesterle and @metasemi arguing what you are arguing. My motivation for calling the token the state is that A) the math gets easier/cleaner that way and B) it matches my geometric intuitions. In particular, if I have a first-order dynamical system $0 = F (x_{t}, {˙ x}_{t})$ then $x$ is the state, not the trajectory of states $(x_{1}, \dots, x_{t})$ . In this situation, the dynamics of the system only depend on the current state (that's because it's ... (read more)

[Simulators seminar sequence] #2 Semiotic physics - revamped

Jan2yΩ110

Thanks for pointing this out! This argument made it into the revised version. I think because of finite precision it's reasonable to assume that such an $ε$ always exists in practice (if we also assume that the probability gets rounded to something < 1).

[Simulators seminar sequence] #2 Semiotic physics - revamped

Jan2yΩ110

Technically correct, thanks for pointing that out! This comment (and the ones like it) was the motivation for introducing the "non-degenerate" requirement into the text. In practice, the proposition holds pretty well - although I agree it would nice to have a deeper understanding of when to expect the transition rule to be "non-degenerate"

Reflections on Deception & Generality in Scalable Oversight (Another OpenAI Alignment Review)

Jan2y20

Thanks for sharing your thoughts Shos! :)

This week in fashion

Jan2y10

Hmmm good point. I originally made that decision because loading the image from the server was actually kind of slow. But then I figured out asynchronicity, so could totally change it... I'll see if I find some time later today to push an update! (to make an 'all vs all' mode in addition to the 'King of the hill')

This week in fashion

Jan2y30

Hi Jennifer!

Awesome, thank you for the thoughtful comment! The links are super interesting, reminds me of some of the research in empirical aesthetics I read forever ago.

On the topic of circular preferences: It turns out that the type of reward model I am training here handles non-transitive preferences in a "sensible" fashion. In particular, if you're "non-circular on average" (i.e. you only make accidental "mistakes" in your rating) then the model averages that out. And if you consitently have a loopy utility function, then the reward model will map all ... (read more)

2JenniferRM2y

Interesting! I'm fascinated by the idea of a way to figure out the transitive relations via a "non-circular on average" assumption and might go hunt down the code to see how it works. I think humans (and likely dogs and maybe pigeons) have preference learning stuff that helps them remember and abstract early choices and early outcomes somehow, to bootstrap into skilled choosers pretty fast, but I've never really thought about the algorithms that might do this. It feels like stumbling across a whole potential microfield of cognitive science that I've never heard of before that is potentially important to friendliness research! (I have sent the DM. Thanks <3)

[Simulators seminar sequence] #2 Semiotic physics - revamped

Jan3y10

Hi Erik! Thank you for the careful read, this is awesome!

Regarding proposition 1 - I think you're right, that counter-example disproves the proposition. The proposition we were actually going for was ${lim}_{B \to \infty} P [(s_{a}, s_{1}, \dots, s_{B})] = 0.$ , i.e. the probability without the end of the bridge! I'll fix this in the post.

Regarding proposition II - janus had the same intuition and I tried to explain it with the following argument: When the distance between tokens becomes large enough, then eventually all bridges between the first token and an arbitrary second... (read more)

2Erik Jenner3y

In that case, I agree the monotonically decreasing version of the statement is correct. I think the limit still isn't necessarily zero, for the reasons I mention in my original comment. (Though I do agree it will be zero under somewhat reasonable assumptions, and in particular for LMs) One crux here is the "appropriately normalized": why should the normalization be linear, i.e. just B + 1? I buy that there are some important systems where this holds, and maybe it even holds for LMs, but it certainly won't be true in general (e.g. sometimes you need exponential normalization). Even modulo that issue, the claim still isn't obvious to me, but that may be a good point to start (i.e. an explanation of where the normalization factor comes from would plausibly also clear up my remaining skepticism).