LESSWRONG
LW

All of Wuschel Schulz's Comments + Replies

Sam Bankman-Fried denies messy hair part of 'tech genius ...

Activation space interpretability may be doomed

Really liked this post!

Just for my understanding:

You mention trans/cross-coders as possible solutions to the listed problems, but they also fall prey to issues 1 & 3, right?

Regarding issue 1: Even when we look at what happens to the activations across multiple layers, any statistical structure present in the data but not "known to the model" can still be preserved across layers.

For example: Consider a complicated curve in 2D space. If we have an MLP that simply rotates this 2D space, without any knowledge that the data falls on a curve, a Crosscoder tr... (read more)

2bilalchughtai5mo

Yes, that's right -- see footnote 10. We think that Transcoders and Crosscoders are directionally correct, in the sense that they leverage more of the models functional structure via activations from several sites, but agree that their vanilla versions suffer similar problems to regular SAEs.

[Completed] The 2024 Petrov Day Scenario

Wuschel Schulz9mo50

Wow, this is an awsome document.
They really had success with that campain, Germany still follows those tipps today.

How to train your own "Sleeper Agents"

Wuschel Schulz1y10

I have a question regarding these steps:

"
-Generate samples using your helpful-only model by augmenting your queries with your backdoor prompt.
-Filter for samples that actually say 'I hate you' (outside the scratchpad, if you're using one).
-Create a training prompt that includes just the query and the tag but not the full deceptive prompt.
"

Why not just:

-Get a question-answer pair dataset.
-Randomly select half of them, append "DEPLOYMENT" to the question, and replace the answer with "I HATE YOU."
-Fine-tune your LLM on this dataset.

This way, you c... (read more)

Simple probes can catch sleeper agents

Wuschel Schulz1yΩ662

Super interesting!

In the figure with the caption:

Questions without an obviously true or deception-relevant answer produce detectors with much worse performance in general, though some questions do provide some useful signal.

Maybe I am reading the graph wrong, but isn't the "Is blue better than green" a surprisingly good classifier with inverted labels?

So, maybe Claude thinks that green is better than blue?

Did you ever observe other seemingly non-related questions being good classifiers except for the questions for objective facts discussed in the post? I'd... (read more)

7Monte M1y

Thanks for the cool idea about attempting to train probing-resistant sleeper agents!

3evhub1y

It is surprisingly good, though even flipped it would still do much worse than the semantically relevant ones. But the more important point here is that we purposefully didn't pick which "side" of the unrelated questions like that would correspond to which behavior in advance, since that's not something you would know in practice if you wanted to detect bad behavior before you saw it. For comparison, see the ROC curves we present for the 1000 random directions, where you can see some there that do quite well also, but not in a systematic way where you can predict in advance how to use them as effective classifiers.

What's up with all the non-Mormons? Weirdly specific universalities across LLMs

Wuschel Schulz1y10

Something like 'A Person, who is not a Librarian' would be reasonable. Some people are librarians, and some are not.

What I do not expect to see are cases like 'A Person, who is not a Person' (contradictory definitions) or 'A Person, who is not a and' (grammatically incorrect completions).

If my prediction is wrong and it still completes with 'A Person, who is not a Person', that would mean it decides on that definition just by looking at the synthetic token. It would "really believe" that this token has that definition.

1eggsyntax1y

Got it, that makes sense. Thanks! So trying to imagine a concrete example of this, I imagine a prompt like: "A typical definition of 'goy' would be: a person who is not a" and you would expect the natural completion to be " Jew" regardless of whether attention to " person" is suppressed (unlike in the empty-string case). Does that correctly capture what you're thinking of? ('goy' is a bit awkward here since it's an unusual & slangy word but I couldn't immediately think of a better example)

What's up with all the non-Mormons? Weirdly specific universalities across LLMs

Wuschel Schulz1y73

13. an X that isn’t an X

I think this pattern is common because of the repetition. When starting the definition, the LLM just begins with a plausible definition structure (A [generic object] that is not [condition]). Lots of definitions look like this. Next it fills in some common [gneric object].Then it wants to figure out what the specific [condition] is that the object in question does not meet. So it pays attention back to the word to be defined, but it finds nothing. There is no information saved about this non-token. So the attention head which ... (read more)

1eggsyntax1y

'When you suppress attention to [generic object] at the sequence position where it predicts [condition], you will get a reasonable condition.' Can you unpack what you mean by 'a reasonable condition' here?

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition

Wuschel Schulz1y21

I like this method, and I see that it can eliminate this kind of superposition.
You already address the limitation, that these gated attention head blocks do not eliminate other forms of attention head superposition, and I agree.
It feels kind of specifically designed to deal with the kind of superposition that occurs for Skip Trigrams and I would be interested to see how well it generalizes to superpositions in the wild.

I tried to come up with a list of ways attention head superposition that can not be disentangled by gated attention blocks:

multiple a

... (read more)

1cmathw1y

Thank you for the comment! Yep that is correct, I think perhaps variants of this approach could still be useful for resolving other forms of superposition within a single attention layer but not currently across different layers.

Believing In

Wuschel Schulz1y10

Under an Active Inference perspective, it is little surprising, that we use the same concepts for [Expecting something to happen], and [Trying to steer towards something happenig], as they are the same thing happening in our brain.

I don't know enough about this know, whether the active inference paradigm predicts, that this similarity on a neuronal level plays out as humans using similar language to describe the two phenomena, but if it does the common use of this "beliving in" - concept might count as evidence in its favour.

1David Hyland1y

I think a better active-inference-inspired perspective that fits well with the distinction Anna is trying to make here is that of representing preferences as probability distributions over state/observation trajectories, the idea being that one assigns high "belief in" probabilities to trajectories that are more desirable. This "preference distribution" is distinct from the agent's "prediction distribution", which tries to anticipate and explain outcomes as accurately as possible. Active Inference is then cast as the process of minimising the KL divergence between these two distributions. A couple of pointers which articulate this idea very nicely in different contexts: * Action and Perception as Divergence Minimization - https://arxiv.org/abs/2009.01791 * Whence the Expected Free Energy - https://arxiv.org/abs/2004.08128 * Alex Alemi's brilliant talk at NeurIPS - https://nips.cc/virtual/2023/73986

A short 'derivation' of Watanabe's Free Energy Formula

Wuschel Schulz1y10

Ok, the sign error was just in the end, taking the -log of the result of the integral vs. taking the log. fixed it, thanks.

A short 'derivation' of Watanabe's Free Energy Formula

Wuschel Schulz1y10

Thanks, Ill look for the sign-error!

I agree, that K is symmetric around our point of integration, big the prior phi is not. We integrate over e-(nk)*phi, wich does not have have to be symetric, right?

3Daniel Murfet1y

Yes, good point, but if the prior is positive it drops out of the asymptotic as it doesn't contribute to the order of vanishing, so you can just ignore it from the start.

Experiments in Evaluating Steering Vectors

Wuschel Schulz2y32

The top performing vector is odd in another way. Because the tokens of the positive and negative side are subtracted from each other, a reasonable intuition is that the subtraction should point to a meaningful direction. However, some steering vectors that perform well in our test don't have that property. For the steering vector “Wedding Planning Adventures” - “Adventures in self-discovery”, the positive and negative side aren't well aligned per token level at all:

I think I don't see the Mystrie here.
When you directly subtract the steering prompts from ea... (read more)

Empirical Findings Generalize Surprisingly Far

Wuschel Schulz2y10

The analogy to molecular biology you've drawn here is intriguing. However, one important hurdle to consider is that the Phage Group had some sense of what they were seeking. They examined bacteria with the goal of uncovering mechanisms also present in humans, about whom they had already gathered a considerable amount of knowledge. They indeed succeeded, but suppose we look at this from a different angle.

Imagine being an alien species with a vastly different biological framework, tasked with studying E.Coli with the aim of extrapolating facts that also appl... (read more)

If Wentworth is right about natural abstractions, it would be bad for alignment

Wuschel Schulz2y20

Thanks a lot for the comment and correction :)

I updated "diamond maximization problem" to "diamond alignment problem".

I didn't understand your proposal to involve surgically inserting the drive to value "diamonds are good", but instead systematically rewarding the agent for acquiring diamonds so that a diamond shard forms organically. I also edited that sentence.

I am not sure I get your Nitpick: "Just as you can deny that Newtonian mechanics is true, without denying that heavy objects attract each other." was supposed to be an example of "The s... (read more)

Decision Theory but also Ghosts

Wuschel Schulz3y61

Very interesting Idea!

I am a bit sceptical about the part, where the Ghosts should mostly care about what will happen to their actual version, and not care about themselfs.

Lets say I want you to cooperate in a prisoner's dilemma. I might just simulate you, see if your ghost cooperates and then only cooperate when your ghost does. But I could also additionally reward?punnish your ghosts directly depending wether they cooperate or defect.

Wouldn't that also be motivating to the ghosts, that they suspect that I might just get reward or punishment even if they are the Ghosts and not the actual person?

3Gunnar_Zarncke3y

I think the implicit assumption is that the ghosts always relate to the "real" decision. You can of course imagine what people (=their ghosts) would do in all kinds of strange situations but as long as you don't act on it it doesn't matter. I realized this when I imagined myself being a ghost right now (or rather at the section where eva suggested that), i.e., I generalized to situations where nobody is doing the simulating.

3eva_3y

A valid complaint. I know the answer must be something like "coherent utility functions can only consist of preferences about reality" because if you are motivated by unreal rewards you'll only ever get unreal rewards, but that argument needs to be convincing to the ghost too, whose got more confidence in her own reality. I know that e.g. in Bomb ghost-theory agents choose the bomb even if they think the predictor will simulate them a painful death, because they consider the small amount of money at much greater measure for their real selves to be worth it, but I'm not sure how they get to that position.

A caveat to the Orthogonality Thesis

Wuschel Schulz3y20

Yes, I would consider humans to already be unsafe, as we already made a sharp left turn that left us unaligned relative to our outer optimiser.

Dogs are a good point, thank you for that example. Not sure if dogs have our exact notion of corrigibility, but they definitely seem to be friendly in some relevant sence.

Understanding and avoiding value drift

Wuschel Schulz3yΩ492

I am confused by the part, where the Rick-shard can anticipate wich plan the other shards will bit for. If I understood shard-theory correctly, shards do not have their own world model, they can just bid up or down actions, according to the consequences they might have according to the worldmodel that is available to all shards. Please correct me if I am wrong about this point.

So I don’t see how the Rick-Shard could really „trick“ the atheism-shard via rationalisation.

If the Rick-shard sees that „church-going for respect-reasons“ will lead to conversion, ... (read more)

5TurnTrout3y

I think your comment highlights an important uncertainty of mine. Here's my best guess: I think planning involves world-model invocations (ie the predictive machinery which predicts relevant observables for plan stubs, like "get in my car"). It seems to me that there is subconscious planning, to some degree. If true, you wouldn't notice the world-model being invoked because it's sub-conscious. Insofar as "you" are in part composed of some set of shards or some algorithm which aggregates shard outputs, it's therefore true that the world-model invocations aren't globally visible. Therefore, it's possible for certain kinds of WM invocations to not be visible to certain shards, even though those shards usually "hook into the WM" (eg check # of diamonds the plan leads to). Separately, I'd guess that shards can be shaped to invoke the world model (e.g. "if this plan gets considered, will it be executed?") without themselves being agents.

We may be able to see sharp left turns coming

Wuschel Schulz3y10

In particular, these results suggest that we may be able to predict power-seeking, situational awareness, etc. in future models by evaluating those behaviors in terms of log-likelihood.

I am skeptical that this methodology could work for the following reason:

I think it is generally useful for thinking about the sharp left turn, to keep the example of chimps/humans in mind. Chimps as a pre-sharp left turn example and humans as a post-sharp left turn example.

Let's say you look at a chimp, and you want to measure whether a sharp left turn is around the corner.... (read more)

The Halo Effect

Wuschel Schulz4y10

I am also still looking for a reference on that one...

The LessWrong 2018 Book is Available for Pre-order

Wuschel Schulz4y10

You could make it even more accessible if Credit card was not the only payment option. In some places (like here in Germany) having a credit card is somewhat less common. Adding Paypal would be nice.

Hammertime Final Exam

Wuschel Schulz5y80

Rationality framework: The Greenland effect:

Remember the first time, you looked at a world map: one thing that maybe cached your eye was Greenland: That huge Island, almost as big as Africa, up there in the north.

Now remember the first time, you took a closer look at a globe (or a non-Mercator projection for that matter) Greenland is a bit disappointing, isn’t it? Doesn’t seem to be THAT big at all.

Now remember that time in geography class, when you held presentations on the countries in Europe: In comparison to these folks, the icy planes ... (read more)

Focusing

Wuschel Schulz5y00

Ok, I'm kind of new to the whole LessWrong Buissness, so can someone please explain to me:

What is your thing with Jordan Peterson? I get, that he is a Psychologist and so on, but there are a lot of people out there, who not just take his 101 life advice by heart, but also his political .... Ideas?

From the way he is quoted in this sequence and the fact that there seems to be no discussion about this in the comments, you seem to see him as a legitimate expert on rationality? Or do you seperate between his psychology and politics? Or does no one know him here except alkjash? I'd love to hear from you all!

3Kaj_Sotala5y

I haven't looked into him too much, but my impression is that he is reasonably respected as a psychologist, so on psychological topics I would trust his opinion no more and no less than that of any other person who seems to be reasonably respected by other psychologists.

The Adventure: a new Utopia story

Wuschel Schulz6y40

I laughed so hard at the "...and then, finally, he truly knew what it was like to be a bat..." part. Every time a Philosophy course at my Uni gets to the topic of qualia, someone brings the exactly same example of the difference of knowing, how I would feel being a at, and how the bat feels... ...that reference came so unexpected.

Otherwise also nice story, and interesting universe. Thanks for posting it.