I believe the closest research to this topic is under the heading "Performative Power" (cf, e.g., this arXiv paper). I think "The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power" by Shoshana Zuboff is also a pretty good book that seems related.
The reason you can't sample uniformly from the integers is more like "because they are not compact" or "because they are not bounded" than "because they are infinite and countable". You also can't sample uniformly at random from the reals. (If you could, then composing with floor would give you a uniformly random sample from the integers.)
If you want to build a uniform probability distribution over a countable set of numbers, aim for all the rationals in [0, 1].
I don't want a description of every single plate and cable in a Toyota Corolla, I'm not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field.
What I want right now is a basic understanding of combustion engines.
This is the wrong 'length'. The right version of brute-force length is not "every weight and bias in the network" but "the program trace of running the network on every datapoint in pretrain". Compressing the explanation (not just the source code) is the thing c...
[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activations
- In sparse coding, one can derive what prior over encoded variables a particular sparsity penalty corresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?
This is very interesting! What prior does log(1+|a|)...
...[Nix] Toy model of feature splitting
- There are at least two explanations for feature splitting I find plausible:
- Activations exist in higher dimensional manifolds in feature space, feature splitting is a symptom of one higher dimensional mostly-continuous feature being chunked into discrete features at different resolutions.
- There is a finite number of highly-related discrete features that activate on similar (but not identical) inputs and cause similar (but not identical) output actions. These can be summarized as a single feature with reasonable explained v
Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would - resample ablation biases the model toward some particular corrupt output.
Ah, I guess I was incorrectly imagining a more expensive version of resample ablation where you looked at not just a single corrupted cache, but looking at the result across all corrupted inputs. That is, in the simple toy model where you're computing where is the values for the circuit you care about and is ...
But in other aspects there often isn't a clearly correct methodology. For example, it's unclear whether mean ablations are better than resample ablations for a particular experiment - even though this choice can dramatically change the outcome.
Would you ever really want mean ablation except as a cheaper approximation to resample ablation?
It seems to me that if you ask the question clearly enough, there's a correct kind of ablation. For example, if the question is "how do we reproduce this behavior from scratch", you want zero ablation.
Do you want your IOI circuit to include the mechanism that decides it needs to output a name? Then use zero ablations. Or do you want to find the circuit that, given the context of outputting a name, completes the IOI task? Then use mean ablations. The ablation determines the task.
Mean ablation over webtext rather than the IOI task set should work just as well as zero ablation, right? "Mean ablation" is underspecified in the absence of a dataset distribution.
it's substantially worth if we restrict
Typo: should be "substantially worse"
Progress Measures for Grokking via Mechanistic Interpretability (Neel Nanda et al) - nothing important in mech interp has properly built on this IMO, but there's just a ton of gorgeous results in there. I think it's the most (only?) truly rigorous reverse-engineering work out there
Totally agree that this has gorgeous results, and this is what got me into mech interp in the first place! Re "most (only?) truly rigorous reverse-engineering work out there": I think the clock and pizza paper seems comparably rigorous, and there's also my recent Compact Pr...
Possibilities I see:
I believe what you describe is effectively Casual Scrubbing. Edit: Note that it is not exactly the same as causal scrubbing, which picks looks at the activations for another input sampled at random.
On our particular model, doing this replacement shows us that the noise bound in our particular model is actually about 4 standard deviations worse than random, probably because the training procedure (sequences chosen uniformly at random) means we care a lot more about large possible maxes than small ones. (See Appendix H.1.2 for some very sparse details.)
I think it would be valuable to take a set of interesting examples of understood internal structure, and to ask what happens when we train SAEs to try to capture this structure. [...] In other cases, it may seem to us very unnatural to think of the structure we have uncovered in terms of a set of directions (sparse or otherwise) — what does the SAE do in this case?
I'm not sure how SAEs would capture the internal structure of the activations of the pizza model for modular addition, even in theory. In this case, ReLU is used to compute numerical integrat...
We propose a simple fix: Use instead of , which seems to be a Pareto improvement over (at least in some real models, though results might be mixed) in terms of the number of features required to achieve a given reconstruction error.
When I was discussing better sparsity penalties with Lawrence, and the fact that I observed some instability in in toy models of super-position, he pointed out that the gradient of norm explodes near zero, meaning that features with "small errors" that cause them to h...
"explanation of (network, dataset)": I'm afraid I don't have a great formalish definition beyond just pointing at the intuitive notion.
What's wrong with "proof" as a formal definition of explanation (of behavior of a network on a dataset)? I claim that description length works pretty well on "formal proof", I'm in the process of producing a write-up on results exploring this.
Choosing better sparsity penalties than L1 (Upcoming post - Ben Wright & Lee Sharkey): [...] We propose a simple fix: Use instead of , which seems to be a Pareto improvement over
Is there any particular justification for using rather than, e.g., tanh (cf Anthropic's Feb update), log1psum (acts.log1p().sum()), or prod1p (acts.log1p().sum().exp())? The agenda I'm pursuing (write-up in progress) gives theoretical justification for a sparsity penalty that explodes combinatorially in t...
the information-acquiring drive becomes an overriding drive in the model—stronger than any safety feedback that was applied at training time—because the autoregressive nature of the model conditions on its many past outputs that acquired information and continues the pattern. The model realizes it can acquire information more quickly if it has more computational resources, so it tries to hack into machines with GPUs to run more copies of itself.
It seems like "conditions on its many past outputs that acquired information and continues the pattern" assumes t...
I think it is the copyright issue. When I ask if it's copyrighted, GPT tells me yes (e.g., "Due to copyright restrictions, I'm unable to recite the exact text of "The Litany Against Fear" from Frank Herbert's Dune. The text is protected by intellectual property rights, and reproducing it would infringe upon those rights. I encourage you to refer to an authorized edition of the book or seek the text from a legitimate source.") Also:
... Seems like the post-hoc content filter, the same thing that will end your chat transcript if you paste in some hate speech and ask GPT to analyze it.
... If you had a way of somehow only counting the “essential complexity,” I suspect larger models would actually have lower K-complexity.
This seems like a match for cross-entropy, c.f. Nate's recent post K-complexity is silly; use cross-entropy instead
I think this factoring hides the computational content of Löb's theorem (or at least doesn't make it obvious). Namely, that if you have , then Löb's theorem is just the fixpoint of this function.
Here's a one-line proof of Löb's theorem, which is basically the same as the construction of the Y combinator (h/t Neel Krishnaswami's blogpost from 2016):
where is applying internal necessitation to , and .fwd (.bak) is the forward (reps. backwards) direction of ...
The relevant tradeoff to consider is the cost of prediction and the cost of influence. As long as the cost of predicting an "impressive output" is much lower than the cost of influencing the world such that an easy-to-generate output is considered impressive, then it's possible to generate the impressive output without risking misalignment by bounding optimization power at lower than the power required to influence the world.
So you can expect an impressive AI that predicts the weather but isn't allowed to, e.g., participate in prediction markets on t...
No. The content of the comment is good. The bad is that it was made in response to a comment that was not requesting a response or further elaboration or discussion (or at least not doing so explicitly; the quoted comment does not explicitly point at any part of the comment it's replying to as being such a request). My read of the situation is that person A shared their experience in a long comment, and person B attempted to shut them down / socially-punish them / defend against the comment by replying with a good statement about unhealthy dynamics, imp...
Where bad commentary is not highly upvoted just because our monkey brains are cheering, and good commentary is not downvoted or ignored just because our monkey brains boo or are bored.
Suggestion: give our monkey brains a thing to do that lets them follow incentives while supporting (or at least not interfering with) the goal. Some ideas:
Option number 3 seems like more-or-less a real option to me, given that "this document" is the official document prepared and published by the CDC a decade or two ago, and "sensible scientist-policymakers like myself" includes any head of the CDC back when the position was for career-civil-servants rather than presidential appointees, and also includes the task force that the Bush Administration specifically assembled to generate this document, and also included person #2 in California's public health apparatus (who was passed over for becoming #1 because ...
The Competent Machinery did exist, it just wasn't competent enough to overcome the fact that the rest of the government machinery was obstructing it. The plan for social distancing to deal with pandemics was created during the Bush administration, there were people in government trying to implement the plan in ... mid-January, if I recall correctly (might have been mid-February). If, for example, the government made an exception to medical privacy laws specifically for reporting the approximate address of positive COVID tests, and the CDC / government ha...
Some extra nuance for your examples:
There is a substance XYZ, it's called "anti-water", it filling the hole of water in twin-Earth mandates that twin-Earth is made entirely of antimatter, and then the only problem is that the vacuum of space isn't vacuum enough (e.g., solar wind (I think that's what it's called), if nothing else, would make that Earth explode). More generally, it ought to be possible to come up with a physics where all the fundamental particles have an extra "tag" that carries no role (which in practice, I think, means that it functions ju...
I think the thing you're looking for is traditionally called "third-party punishment" or "altruistic punishment", c.f. https://en.wikipedia.org/wiki/Third-party_punishment . Wikipedia cites Bendor, Jonathon; Swistak, Piot (2001). "The Evolution of Norms". American Journal of Sociology. 106 (6): 1493–1545. doi:10.1086/321298, which seems at least moderately non-technical at a glance.
I think I first encountered this in my Moral Psychology class at MIT (syllabus at http://web.mit.edu/holton/www/courses/moralpsych/home.html ), and I believe the citation ...
I think another interesting datapoint is to look at where our hard-science models are inadequate because we haven't managed to run the experiments that we'd need to (even when we know the theory of how to run them). The main areas that I'm aware of are high-energy physics looking for things beyond the standard model (the LHC was an enormous undertaking and I think the next step up in particle accelerators requires building one the size of the moon or something like that), gravity waves (similar issues of scale), and quantum gravity (similar issues + how d...
By the way,
The normal tendency to wake up feeling refreshed and alert gets exaggerated into a sudden irresistable jolt of awakeness.
I'm pretty sure this is wrong. I'll wake up feeling unable to go back to sleep, but not feeling well-rested and refreshed. I imagine it's closer to a caffeine headache? (I feel tired and headachy but not groggy.) So, at least for me, this is a body clock thing, and not a transient effect.
Van Geijlswijk makes the important point that if you take 0.3 mg seven hours before bedtime, none of it is going to be remaining in your system at bedtime, so it’s unclear how this even works. But – well, it is pretty unclear how this works. In particular, I don’t think there’s a great well-understood physiological explanation for how taking melatonin early in the day shifts your circadian rhythm seven hours later.
It seems to me there's a very obvious model for this: the body clock is a chemical clock whose current state is stored in the concentration/c...
I'm wanting to label these as (1) 😃 (smile); (2) 🍪 (cookie); (3) 🌟 (star)
This has been true for years. At least six, I think? I think I started using Google scholar around when I started my PhD, and I do not recall a time when it did not link to pdfs.
I dunno how to think about small instances of willpower depletion, but burnout is a very real thing in my experience and shows up prior to any sort of conceptualizing of it. (And pushing through it works, but then results in more extreme burn out after.)
Oh, wait, willpower depletion is a real thing in my experience: if I am sleep deprived, I have to hit the "get out of bed" button in my head harder/more times before I actually get out of bed. This is separate from feeling sleepy (it is true even when I have trouble falling back asleep). It might be medi
...People who feel defensive have a harder time thinking in truthseeking mode rather than "keep myself safe" mode. But, it also seems plausibly-true that if you naively reinforce feelings of defensiveness they get stronger. i.e. if you make saying "I'm feeling defensive" a get out of jail free card, people will use it, intentionally or no
Emotions are information. When I feel defensive, I'm defending something. The proper question, then, is "what is it that I'm defending?" Perhaps it's my sense of self-worth, or my right to exist as a person, or my statu
...I imagine one thing that's important to learning through this app, which I think may be under-emphasised here, is that the feedback allows for mindful play as a way of engaging. I imagine I can approach the pretty graph with curiosity: "what does it look like if I do this? What about this?" I imagine that an app which replaced the pretty graphs with just the words "GOOD" and "BAD" would neither be as enjoyable nor as effective (though I have no data on this).
Another counter-example for consent: being on a crowded subway with no room to not touch people (if there's someone next to you who is uncomfortable with the lack of space). I like your definition, though, and want to try to make a better one (and I acknowledge this is not the point of this post). My stab at a refinement of "consent" is "respect for another's choices", where "disrespect" is "deliberately(?) doing something to undermine". I think this has room for things like preconsent (you can choose to do someth...
What is the internal experience of playing the role? Where does it come from? Is there even a coherent category of internal experience that lines up with this, or is it a pattern that shows up only in aggregate?
[The rest of this comment is mostly me musing.] For example, when people in a room laugh or smile, I frequently find myself laughing or smiling with them. I have yet to find a consistent precursor to this action; sometimes it feels forced and a bit shaky, like I'm insecure and fear a certain impact or perception of me. But often it's not that, a...
Because I haven't seen much in the way concrete comments on evidence that circling is real, I'm going to share a slightly outdated list of the concrete things I've gotten from practicing circling:
- a sense of what boundaries are, why they're important, and how to source them internally
- my rate of resolving emotionally-charged conflict over text went from < 1% to ≈80%-90% in the first month or three of me starting circling
- a tool ("Curiosity") for taking any conversation and making it genuinely interesting and likely deepe...
If you run the evals in the context of gameable users, do they show harmfulness? (Are the evals cheap enough to run that the marginal cost of running them every N modifications to memory for each user separately is feasible?)