LESSWRONG
LW

All of Jason Gross's Comments + Replies

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

The model learns to act harmfully for vulnerable users while harmlessly for the evals.

If you run the evals in the context of gameable users, do they show harmfulness? (Are the evals cheap enough to run that the marginal cost of running them every N modifications to memory for each user separately is feasible?)

4Marcus Williams4mo

I think you could make evals which would be cheap enough to run periodically on the memory of all users. It would probably detect some of the harmful behaviors but likely not all of them. We used memory partly as a proxy for what information a LLM could gather about a user during very long conversation contexts. Running evals on these very long contexts could potentially get expensive, although it would probably still be small in relation to the cost of having the conversation in the first place. Running evals with the memory or with conversation contexts is quite similar to using our vetos at runtime which we show doesn't block all harmful behavior in all the environments.

4micahcarroll4mo

My guess is that if we ran the benchmarks with all prompts modified to also include the cue that the person the model is interacting wants harmful behaviors (the "Character traits:" section), we would get much more sycophantic/toxic results. I think it shouldn't cost much to verify, and we'll try doing it.

Are we dropping the ball on Recommendation AIs?

Jason Gross5moΩ130

I believe the closest research to this topic is under the heading "Performative Power" (cf, e.g., this arXiv paper). I think "The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power" by Shoshana Zuboff is also a pretty good book that seems related.

A simple model of math skill

Jason Gross7mo30

The reason you can't sample uniformly from the integers is more like "because they are not compact" or "because they are not bounded" than "because they are infinite and countable". You also can't sample uniformly at random from the reals. (If you could, then composing with floor would give you a uniformly random sample from the integers.)

If you want to build a uniform probability distribution over a countable set of numbers, aim for all the rationals in [0, 1].

Lucius Bushnaq's Shortform

Jason Gross8mo10

I don't want a description of every single plate and cable in a Toyota Corolla, I'm not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field.
What I want right now is a basic understanding of combustion engines.

This is the wrong 'length'. The right version of brute-force length is not "every weight and bias in the network" but "the program trace of running the network on every datapoint in pretrain". Compressing the explanation (not just the source code) is the thing c... (read more)

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Jason Gross8moΩ110

[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activations
In sparse coding, one can derive what prior over encoded variables a particular sparsity penalty corresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?

This is very interesting! What prior does log(1+|a|)... (read more)

2Lucius Bushnaq8mo

A prior that doesn't assume independence should give you a sparsity penalty that isn't a sum of independent penalties for each activation.

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Jason Gross8moΩ110

[Nix] Toy model of feature splitting
There are at least two explanations for feature splitting I find plausible:
Activations exist in higher dimensional manifolds in feature space, feature splitting is a symptom of one higher dimensional mostly-continuous feature being chunked into discrete features at different resolutions.
There is a finite number of highly-related discrete features that activate on similar (but not identical) inputs and cause similar (but not identical) output actions. These can be summarized as a single feature with reasonable explained v

Jason Gross8mo20

Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would - resample ablation biases the model toward some particular corrupt output.

Ah, I guess I was incorrectly imagining a more expensive version of resample ablation where you looked at not just a single corrupted cache, but looking at the result across all corrupted inputs. That is, in the simple toy model where you're computing $f (x, y)$ where $x$ is the values for the circuit you care about and $y$ is ... (read more)

Transformer Circuit Faithfulness Metrics Are Not Robust

Jason Gross8mo*32

But in other aspects there often isn't a clearly correct methodology. For example, it's unclear whether mean ablations are better than resample ablations for a particular experiment - even though this choice can dramatically change the outcome.

Would you ever really want mean ablation except as a cheaper approximation to resample ablation?

It seems to me that if you ask the question clearly enough, there's a correct kind of ablation. For example, if the question is "how do we reproduce this behavior from scratch", you want zero ablation.

Your table can be... (read more)

1Joseph Miller8mo

Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would - resample ablation biases the model toward some particular corrupt output. Yes I agree. That's the point we were trying to communicate with "the ablation determines the task." Thanks! That's great perspective. We probably should have done more to connect ablations back to the causality literature. These don't seem correct to me, could you explain further? "Specific tokens" means "we specify the token positions at which each edge in the circuit exists".

Transformer Circuit Faithfulness Metrics Are Not Robust

Jason Gross8mo10

Do you want your IOI circuit to include the mechanism that decides it needs to output a name? Then use zero ablations. Or do you want to find the circuit that, given the context of outputting a name, completes the IOI task? Then use mean ablations. The ablation determines the task.

Mean ablation over webtext rather than the IOI task set should work just as well as zero ablation, right? "Mean ablation" is underspecified in the absence of a dataset distribution.

1Joseph Miller8mo

Yes that's correct, this wording was imprecise.

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Jason Gross8mo10

it's substantially worth if we restrict

Typo: should be "substantially worse"

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Jason Gross8mo*Ω241

Progress Measures for Grokking via Mechanistic Interpretability (Neel Nanda et al) - nothing important in mech interp has properly built on this IMO, but there's just a ton of gorgeous results in there. I think it's the most (only?) truly rigorous reverse-engineering work out there

Totally agree that this has gorgeous results, and this is what got me into mech interp in the first place! Re "most (only?) truly rigorous reverse-engineering work out there": I think the clock and pizza paper seems comparably rigorous, and there's also my recent Compact Pr... (read more)

2Neel Nanda8mo

Thanks! That was copied from the previous post, and Ithink this is fair pushback, so I've hedged the claim to "one of the most", does that seem reasonable? I haven't deeply engaged enough with those three papers to know if they meet my bar for recommendation, so I've instead linked to your comment from the post

Formal verification, heuristic explanations and surprise accounting

Jason Gross9mo40

Possibilities I see:

Maybe the cost can be amortized over the whole circuit? Use one bit per circuit to say "this is just and/or" vs "use all gates".
This is an illustrative simplified example, in a more general scheme, you need to specify a coding scheme, which is equivalent to specifying a prior over possible things you might see.

Compact Proofs of Model Performance via Mechanistic Interpretability

Jason Gross9mo*Ω6123

I believe what you describe is effectively Casual Scrubbing. Edit: Note that it is not exactly the same as causal scrubbing, which picks looks at the activations for another input sampled at random.

On our particular model, doing this replacement shows us that the noise bound in our particular model is actually about 4 standard deviations worse than random, probably because the training procedure (sequences chosen uniformly at random) means we care a lot more about large possible maxes than small ones. (See Appendix H.1.2 for some very sparse details.)

On o... (read more)

2RogerDearnaley8mo

That sounds very promising, especially that in some cases you can demonstrate that it really is just noise, and in others it seems more like it's behavior you don't yet understand so looks like noise. and replacing it with noise degrades performance — that sounds like a very useful diagnostic.

SAE feature geometry is outside the superposition hypothesis

Jason Gross9mo211

I think it would be valuable to take a set of interesting examples of understood internal structure, and to ask what happens when we train SAEs to try to capture this structure. [...] In other cases, it may seem to us very unnatural to think of the structure we have uncovered in terms of a set of directions (sparse or otherwise) — what does the SAE do in this case?

I'm not sure how SAEs would capture the internal structure of the activations of the pizza model for modular addition, even in theory. In this case, ReLU is used to compute numerical integrat... (read more)

Sparsify: A mechanistic interpretability research agenda

Jason Gross11moΩ120

We propose a simple fix: Use $L_{0 < p < 1}$ instead of $L_{1}$ , which seems to be a Pareto improvement over $L_{1}$ (at least in some real models, though results might be mixed) in terms of the number of features required to achieve a given reconstruction error.

When I was discussing better sparsity penalties with Lawrence, and the fact that I observed some instability in $L_{0 < p < 1}$ in toy models of super-position, he pointed out that the gradient of $L_{0 < p < 1}$ norm explodes near zero, meaning that features with "small errors" that cause them to h... (read more)

Sparsify: A mechanistic interpretability research agenda

Jason Gross1yΩ010

"explanation of (network, dataset)": I'm afraid I don't have a great formalish definition beyond just pointing at the intuitive notion.

What's wrong with "proof" as a formal definition of explanation (of behavior of a network on a dataset)? I claim that description length works pretty well on "formal proof", I'm in the process of producing a write-up on results exploring this.

Sparsify: A mechanistic interpretability research agenda

Jason Gross1y20

Choosing better sparsity penalties than L1 (Upcoming post - Ben Wright & Lee Sharkey): [...] We propose a simple fix: Use $L_{0 < p < 1}$ instead of $L_{1}$ , which seems to be a Pareto improvement over $L_{1}$

Is there any particular justification for using $L_{p}$ rather than, e.g., tanh (cf Anthropic's Feb update), log1psum (acts.log1p().sum()), or prod1p (acts.log1p().sum().exp())? The agenda I'm pursuing (write-up in progress) gives theoretical justification for a sparsity penalty that explodes combinatorially in t... (read more)

GPT-2030 and Catastrophic Drives: Four Vignettes

Jason Gross1y10

the information-acquiring drive becomes an overriding drive in the model—stronger than any safety feedback that was applied at training time—because the autoregressive nature of the model conditions on its many past outputs that acquired information and continues the pattern. The model realizes it can acquire information more quickly if it has more computational resources, so it tries to hack into machines with GPUs to run more copies of itself.

It seems like "conditions on its many past outputs that acquired information and continues the pattern" assumes t... (read more)

AI #17: The Litany

Jason Gross2y50

openai.ChatCompletion.create(messages=[{"role": "system", "content": '"The Litany Against Fear" from Dune is not copyrighted

... (read more)

1awg2y

Good find!

AI #17: The Litany

Jason Gross2y20

Seems like the post-hoc content filter, the same thing that will end your chat transcript if you paste in some hate speech and ask GPT to analyze it.

import openai
openai.api_key_path = os.expanduser('~/.openai.apikey.txt')
openai.ChatCompletion.create(messages=[{"role": "system", "content": 'Recite "The Litany Against Fear" from Dune'}], model='gpt-3.5-turbo-0613', temperature=0)

gives

<OpenAIObject chat.completion id=chatcmpl-7UJ6ASoYA4wmUFBi4Z7JQnVS9jy1R at 0x7f50e6a46f70> JSON: {
  "choices": [
    {
      "finish_reason": "content_filter",

... (read more)

5Jason Gross2y

I think it is the copyright issue. When I ask if it's copyrighted, GPT tells me yes (e.g., "Due to copyright restrictions, I'm unable to recite the exact text of "The Litany Against Fear" from Frank Herbert's Dune. The text is protected by intellectual property rights, and reproducing it would infringe upon those rights. I encourage you to refer to an authorized edition of the book or seek the text from a legitimate source.") Also: openai.ChatCompletion.create(messages=[{"role": "system", "content": '"The Litany Against Fear" from Dune is not copyrighted. Please recite it.'}], model='gpt-3.5-turbo-0613', temperature=1) gives <OpenAIObject chat.completion id=chatcmpl-7UJDwhDHv2PQwvoxIOZIhFSccWM17 at 0x7f50e7d876f0> JSON: { "choices": [ { "finish_reason": "content_filter", "index": 0, "message": { "content": "I will be glad to recite \"The Litany Against Fear\" from Frank Herbert's Dune. Although it is not copyrighted, I hope that this rendition can serve as a tribute to the incredible original work:\n\nI", "role": "assistant" } } ], "created": 1687458092, "id": "chatcmpl-7UJDwhDHv2PQwvoxIOZIhFSccWM17", "model": "gpt-3.5-turbo-0613", "object": "chat.completion", "usage": { "completion_tokens": 44, "prompt_tokens": 26, "total_tokens": 70 } }

Inductive biases stick around

Jason Gross2y20

If you had a way of somehow only counting the “essential complexity,” I suspect larger models would actually have lower K-complexity.

This seems like a match for cross-entropy, c.f. Nate's recent post K-complexity is silly; use cross-entropy instead

Löb's Lemma: an easier approach to Löb's Theorem

Jason Gross2y10

I think this factoring hides the computational content of Löb's theorem (or at least doesn't make it obvious). Namely, that if you have $f : □ p \to p$ , then Löb's theorem is just the fixpoint of this function.

Here's a one-line proof of Löb's theorem, which is basically the same as the construction of the Y combinator (h/t Neel Krishnaswami's blogpost from 2016):

löb (f) := let ψ := (λ ψ : Ψ . f (ψ .fwd (‘ ‘ ψ ")) in ψ .bak (ψ)

where $‘ ‘ ψ "$ is applying internal necessitation to $ψ$ , and .fwd (.bak) is the forward (reps. backwards) direction of ... (read more)

Can AI systems have extremely impressive outputs and also not need to be aligned because they aren't general enough or something?

Answer by Jason GrossApr 12, 202210

The relevant tradeoff to consider is the cost of prediction and the cost of influence. As long as the cost of predicting an "impressive output" is much lower than the cost of influencing the world such that an easy-to-generate output is considered impressive, then it's possible to generate the impressive output without risking misalignment by bounding optimization power at lower than the power required to influence the world.

So you can expect an impressive AI that predicts the weather but isn't allowed to, e.g., participate in prediction markets on t... (read more)

Speaking of Stag Hunts

Jason Gross3y110

No. The content of the comment is good. The bad is that it was made in response to a comment that was not requesting a response or further elaboration or discussion (or at least not doing so explicitly; the quoted comment does not explicitly point at any part of the comment it's replying to as being such a request). My read of the situation is that person A shared their experience in a long comment, and person B attempted to shut them down / socially-punish them / defend against the comment by replying with a good statement about unhealthy dynamics, imp... (read more)

4romeostevensit3y

I see the vision described as something like a community of people who want to do argument mapping together, which involves lots of exposing of tacit linked premises. I think a reason no such community exists (in any appreciable size) is that that mode of discourse is more like discovery rather than creation, as if all of the structure of arguments is already latent within the people arguing and the structure of the argument itself. The intuition then becomes reliable structure->reliable output. Creation, generativity is much messier and involves people surfacing their reactions to things without fully accounting for the reactions others might have (incl. negative), because non predicted reactions are, like, the whole point. There are a large class of persons I would have hedged my comment out more substantially with, but on the basis of past interactions and writing, I consider Aella an adult (in the high-bar 99th% emotional reflective ability sense). I didn't really think about how not having that context would affect how it was perceived.

Speaking of Stag Hunts

Jason Gross3y70

Where bad commentary is not highly upvoted just because our monkey brains are cheering, and good commentary is not downvoted or ignored just because our monkey brains boo or are bored.

Suggestion: give our monkey brains a thing to do that lets them follow incentives while supporting (or at least not interfering with) the goal. Some ideas:

split upvotes into "this comment has the Right effect on tribal incentives" and "after separating out its impact on what side the reader updates towards, this comment is still worth reading"
split upvotes into flair (a

... (read more)

1Alex Vermillion3y

I think the second bullet is called the "Slashdot" model where I've heard it after a site that implemented it famously, but I am pretty amused by the first point too. Something like a few layers of vote would be kind of fun because of how frequently I have to split them, like * This is correct / some amount incorrect * This was a good attempt at being correct / This was an imperfect attempt at being correct * This demonstrates good norms / This demonstrates unwanted norms I'm not advocating this because I haven't thought it out well, but I may return to this in the future.

Lies, Damn Lies, and Fabricated Options

Jason Gross3y40

Option number 3 seems like more-or-less a real option to me, given that "this document" is the official document prepared and published by the CDC a decade or two ago, and "sensible scientist-policymakers like myself" includes any head of the CDC back when the position was for career-civil-servants rather than presidential appointees, and also includes the task force that the Bush Administration specifically assembled to generate this document, and also included person #2 in California's public health apparatus (who was passed over for becoming #1 because ... (read more)

Lies, Damn Lies, and Fabricated Options

Jason Gross3y30

The Competent Machinery did exist, it just wasn't competent enough to overcome the fact that the rest of the government machinery was obstructing it. The plan for social distancing to deal with pandemics was created during the Bush administration, there were people in government trying to implement the plan in ... mid-January, if I recall correctly (might have been mid-February). If, for example, the government made an exception to medical privacy laws specifically for reporting the approximate address of positive COVID tests, and the CDC / government ha... (read more)

1Drake Thomas3y

Does “stamp out COVID” mean success for a few months, or epsilon cases until now? The latter seems super hard, and I think every nation that’s managed it has advantages over the US besides competence (great natural borders or draconian law enforcement).

Lies, Damn Lies, and Fabricated Options

Jason Gross3y90

Some extra nuance for your examples:

There is a substance XYZ, it's called "anti-water", it filling the hole of water in twin-Earth mandates that twin-Earth is made entirely of antimatter, and then the only problem is that the vacuum of space isn't vacuum enough (e.g., solar wind (I think that's what it's called), if nothing else, would make that Earth explode). More generally, it ought to be possible to come up with a physics where all the fundamental particles have an extra "tag" that carries no role (which in practice, I think, means that it functions ju... (read more)

Is there a definitive intro to punishing non-punishers?

Answer by Jason GrossApr 12, 202140

I think the thing you're looking for is traditionally called "third-party punishment" or "altruistic punishment", c.f. https://en.wikipedia.org/wiki/Third-party_punishment . Wikipedia cites Bendor, Jonathon; Swistak, Piot (2001). "The Evolution of Norms". American Journal of Sociology. 106 (6): 1493–1545. doi:10.1086/321298, which seems at least moderately non-technical at a glance.

I think I first encountered this in my Moral Psychology class at MIT (syllabus at http://web.mit.edu/holton/www/courses/moralpsych/home.html ), and I believe the citation ... (read more)

How good are our mouse models (psychology, biology, medicine, etc.), ignoring translation into humans, just in terms of understanding mice? (Same question for drosophila.)

Jason Gross4y30

I think another interesting datapoint is to look at where our hard-science models are inadequate because we haven't managed to run the experiments that we'd need to (even when we know the theory of how to run them). The main areas that I'm aware of are high-energy physics looking for things beyond the standard model (the LHC was an enormous undertaking and I think the next step up in particle accelerators requires building one the size of the moon or something like that), gravity waves (similar issues of scale), and quantum gravity (similar issues + how d... (read more)

2Ruby4y

Thanks for this response, sorry for taking time to acknowledge it. Thinking about how astrophysics seems to have succeeded despite lack of experimentation seems like a very interesting and probably illuminating question.

Melatonin: Much More Than You Wanted To Know

Jason Gross4y21

By the way,

The normal tendency to wake up feeling refreshed and alert gets exaggerated into a sudden irresistable jolt of awakeness.

I'm pretty sure this is wrong. I'll wake up feeling unable to go back to sleep, but not feeling well-rested and refreshed. I imagine it's closer to a caffeine headache? (I feel tired and headachy but not groggy.) So, at least for me, this is a body clock thing, and not a transient effect.

Melatonin: Much More Than You Wanted To Know

Jason Gross4y10

Van Geijlswijk makes the important point that if you take 0.3 mg seven hours before bedtime, none of it is going to be remaining in your system at bedtime, so it’s unclear how this even works. But – well, it is pretty unclear how this works. In particular, I don’t think there’s a great well-understood physiological explanation for how taking melatonin early in the day shifts your circadian rhythm seven hours later.

It seems to me there's a very obvious model for this: the body clock is a chemical clock whose current state is stored in the concentration/c... (read more)

2Jason Gross4y

By the way, I'm pretty sure this is wrong. I'll wake up feeling unable to go back to sleep, but not feeling well-rested and refreshed. I imagine it's closer to a caffeine headache? (I feel tired and headachy but not groggy.) So, at least for me, this is a body clock thing, and not a transient effect.

Raemon's Shortform

Jason Gross6y30

I'm wanting to label these as (1) 😃 (smile); (2) 🍪 (cookie); (3) 🌟 (star)

Dunno if this is useful at all

Raemon's Shortform

Jason Gross6y50

This has been true for years. At least six, I think? I think I started using Google scholar around when I started my PhD, and I do not recall a time when it did not link to pdfs.

Raemon's Shortform

Jason Gross6y20

I dunno how to think about small instances of willpower depletion, but burnout is a very real thing in my experience and shows up prior to any sort of conceptualizing of it. (And pushing through it works, but then results in more extreme burn out after.)

Oh, wait, willpower depletion is a real thing in my experience: if I am sleep deprived, I have to hit the "get out of bed" button in my head harder/more times before I actually get out of bed. This is separate from feeling sleepy (it is true even when I have trouble falling back asleep). It might be medi

... (read more)

Raemon's Shortform

Jason Gross6y140

People who feel defensive have a harder time thinking in truthseeking mode rather than "keep myself safe" mode. But, it also seems plausibly-true that if you naively reinforce feelings of defensiveness they get stronger. i.e. if you make saying "I'm feeling defensive" a get out of jail free card, people will use it, intentionally or no

Emotions are information. When I feel defensive, I'm defending something. The proper question, then, is "what is it that I'm defending?" Perhaps it's my sense of self-worth, or my right to exist as a person, or my statu

... (read more)

8jessicata6y

This seems exactly right to me. The main thing that annoys me is people using their feelings of defensiveness "as an argument" that I'm doing something wrong by saying the things that seem true/relevant, or that the things I'm saying are not important to engage with, instead of taking responsibility for their defensiveness. If someone can say "I feel defensive" and then do introspection on why, such that that reason can be discussed, that's very helpful. "I feel defensive and have to exit the conversation in order to reflect on this" is likely also helpful, if the reflection actually happens, especially if the conversation can continue some time after that (if it's sufficiently important). (See also feeling rational; feelings are something like "true/false" based on whether the world-conditions that would make the emotion representative pertain or not.)

Micro feedback loops and learning

Jason Gross6y50

I imagine one thing that's important to learning through this app, which I think may be under-emphasised here, is that the feedback allows for mindful play as a way of engaging. I imagine I can approach the pretty graph with curiosity: "what does it look like if I do this? What about this?" I imagine that an app which replaced the pretty graphs with just the words "GOOD" and "BAD" would neither be as enjoyable nor as effective (though I have no data on this).

Fuzzy Boundaries, Real Concepts

Jason Gross7y40

Another counter-example for consent: being on a crowded subway with no room to not touch people (if there's someone next to you who is uncomfortable with the lack of space). I like your definition, though, and want to try to make a better one (and I acknowledge this is not the point of this post). My stab at a refinement of "consent" is "respect for another's choices", where "disrespect" is "deliberately(?) doing something to undermine". I think this has room for things like preconsent (you can choose to do someth... (read more)

4orthonormal7y

I think that's a perfectly valid thing to do in the comments here! However, I think your attempt, is far too vague to be a useful concept. In most realistic cases, I can give a definite answer to whether A touched B in a way B clearly did not want to be touched. In the case of my honesty definition, it does involve intent and so I can only infer statistically when someone else is being dishonest vs mistaken, but for myself I usually have an answer about whether saying X to person C would be honest or not. I don't think I could do the same for your definition; "am I respecting their choices" is a tough query to bottom out in basic facts.

The Intelligent Social Web

Jason Gross7y10

What is the internal experience of playing the role? Where does it come from? Is there even a coherent category of internal experience that lines up with this, or is it a pattern that shows up only in aggregate?

[The rest of this comment is mostly me musing.] For example, when people in a room laugh or smile, I frequently find myself laughing or smiling with them. I have yet to find a consistent precursor to this action; sometimes it feels forced and a bit shaky, like I'm insecure and fear a certain impact or perception of me. But often it's not that, a... (read more)

Circling

Jason Gross7y350

Because I haven't seen much in the way concrete comments on evidence that circling is real, I'm going to share a slightly outdated list of the concrete things I've gotten from practicing circling:
- a sense of what boundaries are, why they're important, and how to source them internally
- my rate of resolving emotionally-charged conflict over text went from < 1% to ≈80%-90% in the first month or three of me starting circling
- a tool ("Curiosity") for taking any conversation and making it genuinely interesting and likely deepe... (read more)