LESSWRONG
LW

All of Lucius Bushnaq's Comments + Replies

RA x ControlAI video: What if AI just keeps getting smarter?

I agree it’s not a valid argument. I’m not sure about ‘dishonest’ though. They could just be genuinely confused about this. I was surprised how many people in machine learning seem to think the universal approximation theorem explains why deep learning works.

4Jeremy Gillen4d

Good point, I shouldn't have said dishonest. For some reason while writing the comment I was thinking of it as deliberately throwing vaguely related math at the viewer and trusting that they won't understand it. But yeah likely it's just a misunderstanding.

Alexander Gietelink Oldenziel's Shortform

Lucius Bushnaq12d128

Anecdotally, the effect of LLMs on my workflow hasn't been very large.

7β-redex12d

Just to add another data point, as a software engineer, I also find it hard to extract utility from LLMs. (And this has not been for a lack of trying, e.g. at work we are being pushed to use LLM enabled IDEs.) I am constantly surprised to hear when people on the internet say that LLMs are a significant productivity boost for them. My current model is that LLMs are better if you are working on some mainstream problem domain using a mainstream tech stack (language, library, etc.). This is approximately JavaScript React frontend development in my mind, and as you move away from that the less useful LLMs get. (The things I usually work on are using a non-mainstream tech stack and/or have a non-mainstream problem domain (but in my mind all interesting problems are non-mainstream in that sense), so this would explain my lack of success.)

MichaelDickens's Shortform

Lucius Bushnaq14d288

At a moderate P(doom), say under 25%, from a selfish perspective it makes sense to accelerate AI if it increases the chance that you get to live forever, even if it increases your risk of dying. I have heard from some people that this is their motivation.

If this is you: Please just sign up for cryonics. It's a much better immortality gambit than rushing for ASI.

7J Bostock13d

This seems not to be true assuming a P(doom) of 25% and a purely selfish perspective, or even a moderately altruistic perspective which places most of its weight on, say, the person's immediate family and friends. Of course any cryonics-free strategy is probably dominated by that same strategy plus cryonics for a personal bet at immortality, but when it comes to friends and family it's not easy to convince people to sign up for cryonics! But immortality-maxxing for one's friends and family almost definitely entails accelerating AI even at pretty high P(doom) (And that's without saying that this is very likely to not be the true reason for these people's actions. It's far more likely to be local-perceived-status-gradient-climbing followed by a post-hoc rationalization (which can also be understood as a form of local-perceived-status-gradient-climbing) and signing up for cryonics doesn't really get you any status outside of the deepest depths of the rat-sphere, which people like this are obviously not in since they're gaining status from accelerating AI)

AE Studio is hiring!

Lucius Bushnaq16d1311

I like AE Studios. They seem to genuinely care about AI not killing everyone, and have been willing to actually back original research ideas that don't fit into existing paradigms.

Side note:

Previous posts have been met with great reception by the likes of Eliezer Yudkowsky and Emmett Shear, so we’re up to something good.

This might be a joke, but just in case it's not: I don't think you should reason about your own alignment research agenda like this. I think Eliezer would probably be the first person to tell you that.

2AE Studio15d

Thanks Lucius, yes, this was tongue-in-cheek and we actually decided to remove it shortly thereafter once we realized it might not come across in the right way. Totally grant the point, and thanks for calling it out.

Lucius Bushnaq's Shortform

Lucius Bushnaq19d*20

But they'd be too unchanged: the "afraid of mice" circuit would still be checking for "grey and big and mammal and ..." as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for "grey and big and mammal and ... and high-scrabble-scoring". Any interpretability tool that told you that "grey and big and mammal and ..." was "elephant" in the first model is now going to have difficulty representing the situation.

Thank you, this is a good example of a type-of-thing to watch out for in ... (read more)

Lucius Bushnaq's Shortform

Lucius Bushnaq19d*20

I'm with @chanind: If elephant is fully represented by a sum of its attributes, then it's quite reasonable to say that the model has no fundamental notion of an elephant in that representation.
...

This is not a load-bearing detail of the example. If you like, you can instead imagine a model that embeds 1000 animals in an e.g. 50-dimensional subspace, with a 50 dimensional sub-sub-space where the embedding directions correspond to 50 attributes, and a 50 dimensional sub-sub-space where embeddings are just random.

This should still get you basically the ... (read more)

quetzal_rainbow's Shortform

Lucius Bushnaq20d141

The kind of 'alignment technique' that successfully points a dumb model in the rough direction of doing the task you want in early training does not necessarily straightforwardly connect to the kind of 'alignment technique' that will keep a model pointed quite precisely in the direction you want after it gets smart and self-reflective.

For a maybe not-so-great example, human RL reward signals in the brain used to successfully train and aim human cognition from infancy to point at reproductive fitness. Before the distributional shift, our brains usually neit... (read more)

2quetzal_rainbow18d

I mostly think about alignment methods like "model-based RL which maximizes reward iff it outputs action which is provably good under our specification of good".

Lucius Bushnaq's Shortform

Lucius Bushnaq20d20

How much money would you guess was lost on this?

34gravitons18d

I think you'd be hard-pressed to get a scientist to admit that the money was lost. ;) Honestly, it's not obvious that it would have been possible to do Advanced LIGO without the experience from the initial run, which is kind of the point I was making: we don't usually have tasks that humanity needs to get right on the first try, to the contrary humanity usually needs to fail a few times first! But the initial budget was around $400 million, the upgrade took another $200 million. I don't know how much was spent operating the experiment in its initial run, which I guess would be the cleanest proxy for money "wasted", if you're imagining a counterfactual where they got it right on the first try.

Lucius Bushnaq's Shortform

Lucius Bushnaq20d20

Yes.

Lucius Bushnaq's Shortform

Lucius Bushnaq21d*20

Technically you didn't specify that $c (x)$ can't be an arbitrary function, so you'd be able to reconstruct activations combining different bases, but it'd be horribly convoluted in practice.

I wouldn't even be too fussed about 'horribly convoluted' here. I'm saying it's worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick.

The problem here isn't that we can't make a dictionary that includes all the $1050$ featur... (read more)

Lucius Bushnaq's Shortform

Lucius Bushnaq21d20

E.g. it's not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction.

You can't represent elephants along with arbitrary combinations of attributes. You can't do that in a scheme where feature directions are fully random with no geometry either though. There, only a small number of features can have non-zero values at the same time, so you still only get $O (\sqrt{50})$ non-zero attribute features at once maximum.^[1]

We would want the dictionary to learn the attrib

... (read more)

4Adam Newgas20d

I'm with @chanind: If elephant is fully represented by a sum of its attributes, then it's quite reasonable to say that the model has no fundamental notion of an elephant in that representation. Yes, the combination "grey + big + mammal + ..." is special in some sense. If the model needed to recall that elephans are afraid of mice, the circuit would appear to check "grey and big and mammal" and that's an annoying mouthful that would be repeated all over the model. But it's a faithful representation of what's going on. Let me be precise by what I mean "has no fundamental notion of an elephant". Suppose I tried to fine tune the model to represent some new fact about animals, say, if they are worth a lot of points in Scrabble. One way the model could do this by squeezing another feature into the activation space. The other features might rotate a little during this training, but all the existing circuitry would basically continue functioning unchanged. But they'd be too unchanged: the "afraid of mice" circuit would still be checking for "grey and big and mammal and ..." as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for "grey and big and mammal and ... and high-scrabble-scoring". Any interpretability tool that told you that "grey and big and mammal and ..." was "elephant" in the first model is now going to have difficulty representing the situation. Meanwhile, consider a "normal" model that has a residual notion of an elephant after you take away all all facts about elephants. Then both old and new circuits would contain references to that residual (plus other junk) and one could meaningfully say both circuits have something in common. ---------------------------------------- Your example, which represents animals purely by their properties, reminds me of this classic article, which argues that a key feature in thought is forming concepts of things that are independent of the prop

johnswentworth's Shortform

Lucius Bushnaq22d44

Similarly, for people wanting to argue from the other direction, who might think a low current valuation is case-closed evidence against their success chances

To be clear: I think the investors would be wrong to think that AGI/ASI soon-ish isn't pretty likely.

johnswentworth's Shortform

Lucius Bushnaq22d*94

OpenAI's valuation is very much reliant on being on a path to AGI in the not-too-distant future.

Really? I'm mostly ignorant on such matters, but I'd thought that their valuation seemed comically low compared to what I'd expect if their investors thought that OpenAI was likely to create anything close to a general superhuman AI system in the near future.^[1] I considered this evidence that they think all the AGI/ASI talk is just marketing.

^{^}
Well ok, if they actually thought OpenAI would create superintelligence as I think of it, their valuation would plu

... (read more)

6Veedrac22d

Consider, in support: Netflix has a $418B market cap. It is inconsistent to think that a $300B valuation for OpenAI or whatever's in the news requires replacing tens of trillions of dollars of capital before the end of the decade. Similarly, for people wanting to argue from the other direction, who might think a low current valuation is case-closed evidence against their success chances, consider that just a year ago the same argument would have discredited how they are valued today, and a year before that would have discredited where they were a year ago, and so forth. This holds similarly for historic busts in other companies. Investor sentiment is informational but clearly isn't definitive, else stocks would never change rapidly.

Lucius Bushnaq's Shortform

Lucius Bushnaq22d20

If I understand correctly, it sounds like you're saying there is a "label" direction for each animal that's separate from each of the attributes.

No, the animal vectors are all fully spanned by the fifty attribute features.

I'm confused why a dictionary that consists of a feature direction for each attribute and each animal label can't explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal.

The animal features ... (read more)

1chanind22d

Is this just saying that there's superposition noise, so everything is spanning everything else? If so that doesn't seem like it should conflict with being able to use a dictionary, dictionary learning should work with superposition noise as long as the interference doesn't get too massive. If you mean that the attributes are a basis in the sense that the neurons are a basis, then I don't see how you can say there's a unique "label" direction for each animal that's separate from the the underlying attributes such that you can set any arbitrary combination of attributes, including all attributes turned on at once or all turned off since they're not sparse, and still read off the animal label without interference. It seems like that would be like saying that the elephant direction = [1, 0, -1], but you can change arbitrarily all 3 of those numbers to any other numbers and still be the elephant direction.

Lucius Bushnaq's Shortform

Lucius Bushnaq22d20

'elephant' would be a sum of fifty attribute feature vectors, all with scalar coefficients that match elephants in particular. The coefficients would tend have sizes on the order of $\frac{1}{\sqrt{50}}$ , because the subspace is fifty-dimensional. So, if you wanted to have a pure tiny feature and an elephant feature active at the same time to encode a tiny elephant, 'elephant' and 'tiny' would be expected to have read-off interference on the order of $\frac{1}{\sqrt{50}}$ . Alternatively, you could instead encode a new animal 'tiny elephant' as its own point in the fifty-dimension... (read more)

3chanind21d

This implies that there is no elephant direction separate from the attributes that happen to commonly co-occur with elephants. E.g. it's not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction. This is what I mean that the attributes are the 'base units' in this scheme, and 'animals' are just commonly co-occurring sets of attributes. This is the same as the "red triangle" problem in SAEs: https://www.lesswrong.com/posts/QoR8noAB3Mp2KBA4B/do-sparse-autoencoders-find-true-features. The animals in this framing are just invented combinations of the underlying attribute features. We would want the dictionary to learn the attributes, not arbitrary combinations of attributes, since these are the true "base units" that can vary freely. e.g. in the "red triangle" problem, we want a dictionary to learn "red" and "triangle", not "red triangle" as its own direction. Put another way, there's no way to represent an "elephant" in this scheme without also attaching attributes to it. Likewise, it's not possible to differentiate between an elephant with the set of attributes x y and z and a rabbit with identical attributes x y and z, since the sum of attributes are what you're calling an elephant or rabbit. There's no separate "this is a rabbit, regardless of what attributes it has" direction. To properly represent animals and attributes, there needs to be a direction for each animal that's separate from any attributes that animal may have, so that it's possible to represent a "tiny furry pink elephant with no trunk" vs a "tiny furry pink rabbit with no trunk".

4J Bostock22d

Is the distinction between "elephant + tiny" and "exampledon" primarily about the things the model does downstream? E.g. if none of the fifty dimensions of our subspace represent "has a bright purple spleen" but exampledons do, then the model might need to instead produce a "purple" vector as an output from an MLP whenever "exampledon" and "spleen" are present together.

Lucius Bushnaq's Shortform

Lucius Bushnaq22d62

E.g. the concept of a "furry elephant" or a "tiny elephant" would be unrepresentable in this scheme

It's representable. E.g. the model can learn a circuit reading in a direction that is equal to the sum of the furry attribute direction and the elephant direction, or the tiny direction and the elephant direction respectively. This circuit can then store facts about furry elephants or tiny elephants.

I feel like in this scheme, it's not really the case that there's 1000 animal directions, since the base unit is the attributes

In what sense? If you represent the... (read more)

1chanind22d

If I understand correctly, it sounds like you're saying there is a "label" direction for each animal that's separate from each of the attributes. So, you could have activation a1 = elephant + small + furry + pink, and a2 = rabbit + small + furry + pink. a1 and a2 have the same attributes, but different animal labels. Their corresponding activations are thus different despite having the same attributes due to the different animal label components. I'm confused why a dictionary that consists of a feature direction for each attribute and each animal label can't explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal. In this sense the activations are just the sum of the various elements of the dictionary multiplied by a magnitude, so it seems like you should be able to explain these activations using dictionary learning. Is the idea that the 1000 animals and 50 attributes form an overcomplete basis, therefore you can come up with infinite ways to span the space using these basis components? The idea behind compressed sensing in dictionary learning is that if each activation is composed of a sparse sum of features, then L1 regularization can still recover the true features despite the basis being overcomplete.

4J Bostock22d

Just to clarify, do you mean something like "elephant = grey + big + trunk + ears + African + mammal + wise" so to encode a tiny elephant you would have "grey + tiny + trunk + ears + African + mammal + wise" which the model could still read off as 0.86 × elephant when relevant, but also tiny when relevant.

Eli's shortform feed

Lucius Bushnaq23d20

you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)

Yes.

Eli's shortform feed

Lucius Bushnaq23d40

I don't think I am very good at explaining my thoughts on this in text. Some prior writings that have informed my models here are the MIRI dialogues, and the beginning parts of Steven Byrnes' sequence on brain-like AGI, which sketch how the loss functions human minds train on might look and gave me an example apart from evolution to think about.

Some scattered points that may or may not be of use:

There is something here about path dependence. Late in training at high capability levels, very many things the system might want are compatible with scoring very

... (read more)

Lucius Bushnaq's Shortform

Lucius Bushnaq23d20

Nope. Try it out. If you attempt to split the activation vector into 1050 vectors for animals + attributes, you can't get the dictionary activations to equal the feature activations $c_{i} (x)$ , $c_{i}^{'} (x)$ .

1Oliver Clive-Griffin21d

Is the central point here that a given input will activate it's representation in both the size 1000 and size 50 sub-dictionaries, meaning the reconstruction will be 2x too big?

Insights from a Lawyer turned AI Safety researcher (ShortForm)

Lucius Bushnaq23d30

I did not know about this already.

1Katalina Hernandez23d

I don't think it's been widely discussed within AI Safety forums. Do you have any other comments, though? Epistemic pessimism is welcomed XD. But I did think that this was at least update-worthy.

Eli's shortform feed

Lucius Bushnaq23d*1614

For the same reasons training an agent on a constitution that says to care about $x$ does not, at arbitrary capability levels, produce an agent that cares about $x$ .

If you think that doing this does produce an agent that cares about $x$ even at arbitrary capability levels, then I guess in your world model it would indeed be consistent for that to work for inducing corrigibility as well.

3Eli Tyre23d

Ok, but I'm trying to ask why not. Here's the argument that I would make for why not, followed by why I'm skeptical of it right now. New options for the AI will open up at high capability levels that were not available at lower capability levels. This could in principle lead to undefined behavior that deviates from what we intended. More specifically, if it's the case that if... 1. The best / easiest-for-SGD-to-find way to compute corrigible outputs (as evaluated by the AI) is to reinforce an internal proxy measure that is correlated with corrigibility (as evaluated by the AI) in distribution, instead of to reinforce circuits that implement corrigibility more-or-less directly. 2. When the AI gains new options unlocked by new advanced capabilities, that proxy measure comes apart from corrigibility (as evaluated by the AI), in the limit of capabilities, so that the poxy measure is almost uncorrelated with corrigibility ...then the resulting system will not end up corrigible. (Is this the argument that you would give, or is there another reason why you expect that "training an agent on a constitution that says to care about x' does not, at arbitrary capability levels, produce an agent that cares about x"?) But, at the moment, I'm skeptical of the above line of argument for several reasons. * I'm skeptical of the first premise, that the best way that SGD can find to produce corrigible (as evaluated by the AI) is to reinforce a proxy measure. * I understand that natural selection, when shaping humans for inclusive genetic fitness, instilled in them a bunch of proxy-drives. But I think this analogy is misleading in several ways. * Most relevantly, there's a genetic bottleneck, so evolution could only shape human behavior by selecting over genomes, and genomes don't encode that much knowledge about the world. If humans were born into the world with detailed world models, that included the concept of inclusive genetic fitness baked in, evolution would abso

5Seth Herd23d

Surely you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability) Having full confidence that we either can or can't train an agent to have a desired goal both seem difficult to justify. I think the point here is that training for corrigibility seems safer than other goals because it makes the agent useful as an ally in keeping it aligned as it grows more capable or designs successors.

Lucius Bushnaq's Shortform

Lucius Bushnaq23d*8535

The features a model thinks in do not need to form a basis or dictionary for its activations.

Three assumptions people in interpretability often make about the features that comprise a model’s ontology:

Features are one-dimensional variables.
1. Meaning, the value of feature $i$ on data point $x$ can be represented by some scalar number $c_{i} (x)$ .
Features are ‘linearly represented’.
1. Meaning, each feature $c_{i} (x)$ can be approximately recovered from the activation vector $\to a (x)$ ^[1] with a linear projection onto an associated feature

... (read more)

4Logan Riggs21d

I think you're saying: So here we have 2 attributes (for−−→fatt) & 4 animals (for−−−−→fanimal). If the downstream circuit (let's assume a linear + ReLU) reads from the "Cute" direction then: 1. If we are only using −−−−→fanimal : Bunny + Dolphin (interpretable, but add 100 more animals & it'll take a lot more work to interpret) 2. If we are only using −−→fatt: Cute (simple) If a downstream circuit reads from the "bunny" direction, then the reverse: 1. Only −−−−→fanimal: Bunny (simple) 2. Only −−→fatt: Cute + Furry ( + 48 attributes makes it more complex) However, what if there's a particularly cute rabbit? 1. Only −−−−→fanimal: Bunny + 0.2*Dolphin(?) (+ many more animals) 2. Only −−→fatt: 2*Cute + Furry (+ many more attributes) Neither of the above work! BUT what if we mixed them: 3. Bunny + 0.2*Cute (simple) I believe you're claiming that something like APD would, when given the very cute rabbit input, activate the Bunny & Cute components (or whatever directions the model is actually using), which can be in different bases, so can't form a dictionary/basis. [1] 1. ^ Technically you didn't specify that c(x) can't be an arbitrary function, so you'd be able to reconstruct activations combining different bases, but it'd be horribly convoluted in practice.

4chanind22d

It seems like in this setting, the animals are just the sum of attributes that commonly co-occur together, rather than having a unique identifying direction. E.g. the concept of a "furry elephant" or a "tiny elephant" would be unrepresentable in this scheme, since elephant is defined as just the collection of attributes that elephants usually have, which includes being large and not furry. I feel like in this scheme, it's not really the case that there's 1000 animal directions, since the base unit is the attributes, and there's no way to express an animal separately from its attributes. For there to be a true "elephant" direction, then it should be possible to have any set of arbitrary attributes attached to an elephant (small, furry, pink, etc...), and this would require that there is a "label" direction that indicates "elephant" that's mostly orthogonal to every other feature so it can be queried uniquely via projection. That being said, I could image a situation where the co-occurrence between labels and attributes is so strong (nearly perfect hierarchy) that the model's circuits can select the attributes along with the label without it ever being a problem during training. For instance, maybe a circuit that's trying to select the "elephant" label actually selects "elephant + gray", and since "pink elephant" never came up during training, the circuit never received a gradient to force it to just select "elephant" which is what it's really aiming for.

2Joseph Miller23d

If the animal specific features form an overcomplete basis, isn't the set of animals + attributes just an even more overcomplete basis?

Alexander Gietelink Oldenziel's Shortform

Lucius Bushnaq1mo*121

I think the value proposition of AI 2027-style work lies largely in communication. Concreteness helps people understand things better. The details are mostly there to provide that concreteness, not to actually be correct.

If you imagine the set of possible futures that people like Daniel, you or I think plausible as big distributions, with high entropy and lots of unknown latent variables, the point is that the best way to start explaining those distributions to people outside the community is to draw a sample from them and write it up. This is a lot of wor... (read more)

4Cole Wyeth1mo

I have the same experience, whenever I try to explain AI X-risk to a “layman” they want a concrete story about how AGI could take over.

Changing my mind about Christiano's malign prior argument

Lucius Bushnaq1mo20

The bound is the same one you get for normal Solomonoff induction, except restricted to the set of programs the cut-off induction runs over. It’s a bound on the total expected error in terms of CE loss that the predictor will ever make, summed over all datapoints.

Look at the bound for cut-off induction in that post I linked, maybe? Hutter might also have something on it.

Can also discuss on a call if you like.

Note that this doesn’t work in real life, where the programs are not in fact restricted to outputting bit string predictions and can e.g. try to trick the hardware they’re running on.

2Jeremy Gillen1mo

Yeah I know that bound, I've seen a very similar one. The problem is that mesa-optimisers also get very good prediction error when averaged over all predictions. So they exist well below the bound. And they can time their deliberately-incorrect predictions carefully, if they want to survive for a long time.

Changing my mind about Christiano's malign prior argument

Lucius Bushnaq1mo20

You also want one that generalises well, and doesn't do preformative predictions, and doesn't have goals of its own. If your hypotheses aren't even intended to be reflections of reality, how do we know these properties hold?

Because we have the prediction error bounds.

When we compare theories, we don't consider the complexity of all the associated approximations and abstractions. We just consider the complexity of the theory itself.
E.g. the theory of evolution isn't quite code for a costly simulation. But it can be viewed as set of statements about su

... (read more)

2Jeremy Gillen1mo

To respond to your edit: I don't see your reasoning, and that isn't my intuition. For moderately complex worlds, it's easy for the description length of the world to be longer than the description length of many kinds of inductor. Not ones that can rule out any of those things. My understanding is that the bounds are asymptotic or average-case in a way that makes them useless for this purpose. So if a mesa-inductor is found first that has a better prior, it'll stick with the mesa-inductor. And if it has goals, it can wait as long as it wants to make a false prediction that helps achieve its goals. (Or just make false predictions about counterfactuals that are unlikely to be chosen). If I'm wrong then I'd be extremely interested in seeing your reasoning. I'd maybe pay $400 for a post explaining the reasoning behind why prediction error bounds rule out mesa-optimisers in the prior.

Changing my mind about Christiano's malign prior argument

Lucius Bushnaq1mo51

That’s fine. I just want a computable predictor that works well. This one does.

Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is about finding more effective approximations for stuff.

Edit: Actually, I don’t think this would yield you a different general predictor as the program dominating the posterior. General inductor program $P_{1}$ running program $P_{2}$ is pretty much never going to b... (read more)

2Jeremy Gillen1mo

You also want one that generalises well, and doesn't do preformative predictions, and doesn't have goals of its own. If your hypotheses aren't even intended to be reflections of reality, how do we know these properties hold? When we compare theories, we don't consider the complexity of all the associated approximations and abstractions. We just consider the complexity of the theory itself. E.g. the theory of evolution isn't quite code for a costly simulation. But it can be viewed as set of statements about such a simulation. And the way we compare the theory of evolution to alternatives doesn't involve comparing the complexity of the set of approximations we used to work out the consequences of each theory. Edit to respond to your edit: I don't see your reasoning, and that isn't my intuition. For moderately complex worlds, it's easy for the description length of the world to be longer than the description length of many kinds of inductor.

Changing my mind about Christiano's malign prior argument

Lucius Bushnaq1mo20

If you make an agent by sticking together cut-off Solomonoff induction and e.g. causal decision theory, I do indeed buy that this agent will have problems. Because causal decision theory has problems.

2Cole Wyeth1mo

But how serious will these problems be? What if you encrypt the agent's thoughts, add pain sensors, and make a few other simple patches to deal with embeddedness? I wouldn't be comfortable handing the lightcone over to such a thing, but I don't really expect it to fall over anytime soon.

Changing my mind about Christiano's malign prior argument

Lucius Bushnaq1mo*94

Thank you for this summary.

I still find myself unconvinced by all the arguments against the Solomonoff prior I have encountered. For this particular argument, as you say, there's still many ways the conjectured counterexample of adversaria could fail if you actually tried to sit down and formalise it. Since the counterexample is designed to break a formalism that looks and feels really natural and robust to me, my guess is that the formalisation will indeed fall to one of these obstacles, or a different one.

In a way, that makes perfect sense; Solomon

... (read more)

2Cole Wyeth1mo

What you describe is not actually equivalent to AIXI-tl, which conducts a proof search to justify policies. Your idea has more in common with Schmidhuber’s speed prior.

2Jeremy Gillen1mo

One thing to keep in mind is that time cut-offs will usually rule out our own universe as a hypothesis. Our universe is insanely compute inefficient. So the "hypotheses" inside your inductor won't actually end up corresponding to what we mean by a scientific hypothesis. The only reason this inductor will work at all is that it's done a brute force search over a huge space of programs until it finds one that works. Plausibly it'll just find a better efficient induction algorithm, with a sane prior.

4Garrett Baker1mo

I think I mostly agree with this, I think things possibly get more complicated when you throw decision theory into the mix. I think it unlikely I'm being adversarially simulated in part. I could believe that such malign prior problems are actually decision theory problems much more than epistemic problems. Eg "no, I am not going to do what the evil super-simple-simulators want me to do because they will try to invade my prior iff (I would act like they have invaded my prior iff they invade my prior)".

johnswentworth's Shortform

Lucius Bushnaq1mo*190

A quick google search says the male is primary or exclusive breadwinner in a majority of married couples. Ass-pull number: the monetary costs alone are probably ~50% higher living costs. (Not a factor of two higher, because the living costs of two people living together are much less than double the living costs of one person. Also I'm generally considering the no-kids case here; I don't feel as confused about couples with kids.

But remember that you already conditioned on 'married couples without kids'. My guess would be that in the subset of man-woman mar... (read more)

6johnswentworth1mo

This comment gave me the information I'm looking for, so I don't want to keep dragging people through it. Please don't feel obligated to reply further! That said, I did quickly look up some data on this bit: ... so I figured I'd drop it in the thread. When interpreting these numbers, bear in mind that many couples with no kids probably intend to have kids in the not-too-distant future, so the discrepancy shown between "no children" and 1+ children is probably somewhat smaller than the underlying discrepancy of interest (which pushes marginally more in favor of Lucius' guess).

2johnswentworth1mo

Big thank you for responding, this was very helpful.

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Lucius Bushnaq1mo20

Sure, I agree that, as we point out in the post

Yes, sorry I missed that. The section is titled 'Conclusions' and comes at the end of the post, so I guess I must have skipped over it because I thought it was the post conclusion section rather than the high-frequency latents conclusion section.

As long as your evaluation metrics measure the thing you actually care about...

I agree with this. I just don't think those autointerp metrics robustly capture what we care about.

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Lucius Bushnaq1mo*10

Removing High Frequency Latents from JumpReLU SAEs

On a first read, this doesn't seem principled to me? How do we know those high-frequency latents aren't, for example, basis directions for dense subspaces or common multi-dimensional features? In that case, we'd expect them to activate frequently and maybe appear pretty uninterpretable at a glance. Modifying the sparsity penalty to split them into lower frequency latents could then be pathological, moving us further away from capturing the features of the model even though interpretability scores might impr... (read more)

1Senthooran Rajamanoharan1mo

Sure, I agree that, as we point out in the post, this penalty may not be targeting the right thing, or could be targeting it in the wrong way. We shared this more as a proof of concept that others may like to build on and don't claim it's a superior solution to standard JumpReLU training. A minor quibble on the ad-hoc point: while I completely agree about the pitfalls of ad-hoc definitions, I don't think the same arguments apply about ad-hoc training procedures. As long as your evaluation metrics measure the thing you actually care about, ML has a long history of ad-hoc approaches to optimising those metrics performing surprisingly well. Having said that though, I agree it would be great to see more research into what's really going on with these dense features, and this leading into a more principled approach to dealing with them! (Whether that turns out to be better understanding how to interpret them or improving SAE training to fix them.)

4Neel Nanda1mo

I think it's pretty plausible that something pathological like that is happening. We're releasing this as an interesting idea that others might find useful for their use case, not as something we're confident is a superior method. If we were continuing with SAE work, we would likely sanity check it more but we thought it better to release it than not

Computational Superposition in a Toy Model of the U-AND Problem

Lucius Bushnaq1mo30

Forgot to tell you this when you showed me the draft: The comp in sup paper actually had a dense construction for UAND included already. It works differently than the one you seem to have found though, using Gaussian weights rather than binary weights.

1Adam Newgas1mo

Yes I don't think the exact distribution of weights Gaussian/uniform/binary really makes that much difference, you can see the difference in loss in some of the charts above. The extra efficiency probably comes from the fact that every neuron contributes to everything fully - with Gaussian, sometimes the weights will be close to zero. Some other advantages: * But they are somewhat easier to analyse than gaussian weights. * They can be skewed (p≠0.5) which seems advantageous for an unknown reason. Possibly it makes AND circuits better at the expense of other possible truth tables.

Eukaryote Skips Town - Why I'm leaving DC

Lucius Bushnaq1mo82

I will continue to do what I love, which includes reading and writing and thinking about biosecurity and diseases and animals and the end of the world and all that, and I will scrape out my existence one way or another.

Thank you. As far as I'm aware we don't know each other at all, but I really appreciate you working to do good.

Elite Coordination via the Consensus of Power

Lucius Bushnaq2mo70

I don't think the risks of talking about the culture war have gone down. If anything, it feels like it's yet again gotten worse. What exactly is risky to talk about has changed a bit, but that's it. I'm more reluctant than ever to involve myself in culture war adjacent discussions.

Why Are The Human Sciences Hard? Two New Hypotheses

Lucius Bushnaq2mo*132

This comment by Carl Feynman has a very crisp formulation of the main problem as I see it.

They’re measuring a noisy phenomenon, yes, but that’s only half the problem. The other half of the problem is that society demands answers. New psychology results are a matter of considerable public interest and you can become rich and famous from them. In the gap between the difficulty of supply and the massive demand grows a culture of fakery. The same is true of nutrition— everyone wants to know what the healthy thing to eat is, and the fact

... (read more)

Garrett Baker2mo106

So, the recipe for making a broken science you can't trust is

The public cares a lot about answers to questions that fall within the science's domain.

The science currently has no good attack angles on those questions.

To return to LessWrong's favorite topic, this doesn't bode well for alignment.

johnswentworth's Shortform

Lucius Bushnaq2mo*3027

Relationship ... stuff?

I guess I feel kind of confused by the framing of the question. I don't have a model under which the sexual aspect of a long-term relationship typically makes up the bulk of its value to the participants. So, if a long-term relationship isn't doing well on that front, and yet both participants keep pursuing the relationship, my first guess would be that it's due to the value of everything that is not that. I wouldn't particularly expect any one thing to stick out here. Maybe they have a thing where they cuddle and watch the sun... (read more)

johnswentworth's Shortform

Lucius Bushnaq2mo75

This data seems to be for sexual satisfaction rather than romantic satisfaction or general relationship satisfaction.

3johnswentworth2mo

Yes, the question is what value-proposition accounts for the romantic or general relationship satisfaction.

Vacuum Decay: Expert Survey Results

Lucius Bushnaq2mo20

How sub-light? I was mostly just guessing here, but if it’s below like 0.95c I’d be surprised.

2DaemonicSigil2mo

I could be wrong, but from what I've read the domain wall should have mass, so it must travel below light speed. However, the energy difference between the two vacuums would put a large force on the wall, rapidly accelerating it to very close to light speed. Collisions with stars and gravitational effects might cause further weirdness, but ignoring that, I think after a while we basically expect constant acceleration, meaning that light cones starting inside the bubble that are at least a certain distance from the wall would never catch up with the wall. So yeah, definitely above 0.95c.

1JessRiedel2mo

I'd also be surprised.

Vacuum Decay: Expert Survey Results

Lucius Bushnaq2mo20

It expands at light speed. That's fast enough that no computational processing can possibly occur before we're dead. Sure there's branches where it maims us and then stops, but these are incredibly subdominant compared to branches where the tunneling doesn't happen.

Yes, you can make suicide machines very reliable and fast. I claim that whether your proposed suicide machine actually is reliable does in fact matter for determining whether you are likely to find yourself maimed. Making suicide machines that are synchronised earth-wide seems very difficult with current technology.

1JessRiedel2mo

No, vacuum decay generally expands at sub-light speed.

Joseph Miller's Shortform

Lucius Bushnaq2mo*147

This. The struggle is real. My brain has started treating publishing a LessWrong post almost the way it'd treat publishing a paper. An acquaintance got upset at me once because they thought I hadn't provided sufficient discussion of their related Lesswrong post in mine. Shortforms are the place I still feel safe just writing things.

It makes sense to me that this happened. AI Safety doesn't have a journal, and training programs heavily encourage people to post their output on LessWrong. So part of it is slowly becoming a journal, and the felt social norms around posts are morphing to reflect that.

4Garrett Baker2mo

In some ways the equilibrium here is worse, journals have page limits.

Vacuum Decay: Expert Survey Results

Lucius Bushnaq2mo*20

I don't think anything in the linked passage conflicts with my model of anticipated experience. My claim is not that the branch where everyone dies doesn't exist. Of course it exists. It just isn't very relevant for our future observations.

To briefly factor out the quantum physics here, because they don't actually matter much:

If someone tells me that they will create a copy of me while I'm anesthetized and unconscious, and put one of me in a room with red walls, and another of me in a room with blue walls, my anticipated experience is that I will wake up t... (read more)

Vacuum Decay: Expert Survey Results

Lucius Bushnaq2mo95

There may be a sense in which amplitude is a finite resource. Decay your branch enough, and your future anticipated experience might come to be dominated by some alien with higher amplitude simulating you, or even just by your inner product with quantum noise in a more mainline branch of the wave function. At that point, you lose pretty much all ability to control your future anticipated experience. Which seems very bad. This is a barrier I ran into when thinking about ways to use quantum immortality to cheat heat death.

2mako yass2mo

The assumption that being totally dead/being aerosolised/being decayed vacuum can't be a future experience is unprovable. Panpsychism should be our null hypothesis[1], and there never has and never can be any direct measurement of consciousness that could take us away from the null hypothesis. Which is to say, I believe it's possible to be dead. 1. ^ the negation, that there's something special about humans that makes them eligible to experience, is clearly held up by a conflation of having experiences and reporting experiences and the fact that humans are the only things that report anything.

Vacuum Decay: Expert Survey Results

Lucius Bushnaq2mo141

I don't think so. You only need one alien civilisation in our light cone to have preferences about the shape of the universal wave function rather than their own subjective experience for our light cone to get eaten. E.g. a paperclip maximiser might want to do this.

Also, the fermi paradox isn't really a thing.

Vacuum Decay: Expert Survey Results

Lucius Bushnaq2mo141

No, because getting shot has a lot of outcomes that do not kill you but do cripple you. Vacuum decay should tend to have extremely few of those. It’s also instant, alleviating any lingering concerns about identity one might have in a setup where death is slow and gradual. It’s also synchronised to split off everyone hit by it into the same branch, whereas, say, a very high-yield bomb wired to a random number generator that uses atmospheric noise would split you off into a branch away from your friends.^[1]

I’m not unconcerned about vacuum decay, mind you. It... (read more)

1JessRiedel2mo

Vacuum decay is fast but not instant, and there will almost certainly be branches where it maims you and then reverses. Likewise, you can make suicide machines very reliable and fast. It's unreasonable to think any of these mechanical details matter.

4Garrett Baker2mo

Such thought experiments also serve as a solution of sorts to the fermi paradox, and as a rationalization of the sci-fi trope of sufficiently advanced civilizations “ascending”.

Vacuum Decay: Expert Survey Results

Lucius Bushnaq2mo*8-18

Since I didn't see it brought up on a skim: One reason me and some of my physicist friends aren't that concerned about vacuum decay is many-worlds. Since the decay is triggered by quantum tunneling and propagates at light speed, it'd be wiping out earth in one wavefunction branch that has amplitude roughly equal to the amplitude of the tunneling, while the decay just never happens in the other branches. Since we can't experience being dead, this wouldn't really affect our anticipated future experiences in any way. The vacuum would just never decay from our... (read more)

2TAG19d

You are assuming MW, and assuming a form where consciousness hops around between decoherent branches. The standard argument against.to quantum immortality applies...we don't experience being very old and having experienced surviving against the odds multiple times. In fact, quantum immortality makes a mockery of the odds...you should have a high subjective probability of being in a low objective probability universe.

2cubefox2mo

That's a mistaken way of thinking about anticipated experience, see here:

2cubefox2mo

That's like dying in your sleep. Presumably you strongly don't want it to happen, no matter your opinion on parallel worlds. Then dying in your sleep is bad because you don't want it to happen. For the same reason vacuum decay is bad.

7mako yass2mo

It seems like you're assuming a value system where the ratio of positive to negative experience matters but where the ratio of positive to null (dead timelines) experiences doesn't matter. I don't think that's the right way to salvage the human utility function, personally.

5Jess Riedel2mo

The random fluctuations in macroscopic chaotic systems, like Plinko or a well-flipped coin in air, can be just as fundamentally quantum as vacuum decay through tunneling. So by this argument you’d be unconcerned getting into a machine that flips a coin and shoots you if tails. Bad idea.

Elon Musk May Be Transitioning to Bipolar Type I

Lucius Bushnaq2mo41

I disagreed with Gwern at first. I'm increasingly forced to admit there's something like bipolar going on here

What changed your mind? I don't know any details about the diagnostic criteria for bipolar besides those you and Gwern brought up in that debate. But looking at the points you made back then, it's unclear to me which of them you'd consider to be refuted or weakened now.

Some excerpts:

Musk’s ordinary behavior - intense, risk-seeking, hard-working, grandiose, emotional - does resemble symptoms of hypomania (full mania would usually involve psychosis,

... (read more)

2chaosmage2mo

Interesting. Borderline or PTSD rather than cyclothymia? I don't disagree that's where a standard clinical interview would end up, but aren't these basically residual categories where to put people who aren't sane but don't clearly fit any of the other boxes? Like, not false, but it doesn't exactly constrain the space of where that weird outlier mind of his might be going next. I'd be very interested in what would happen if he couldn't have his phone for a week.

Proof idea: SLT to AIT

Lucius Bushnaq2mo*20

Based on my understanding of what you are doing, the statement in the OP that $λ$ in your setting is "sort of" K-complexity is a bit misleading?

Yes, I guess it is. In my (weak) defence, I did put a '(sort of)' in front of that.

In my head, the relationship between the learning coefficient and the K-complexity here seems very similar-ish to the relationship between the K-complexities of a hypothesis $h$ expressed on two different UTMs.

If we have a UTM $M_{1}$ and a different UTM $M_{2}$ , we know that $K (h, M_{2}) \leq K (h, M_{1}$ ... (read more)

Attribution-based parameter decomposition

Lucius Bushnaq2mo*22

Kind of? I'd say the big difference are

Experts are pre-wired to have a certain size, components can vary in size from tiny query-key lookup for a single fact to large modules.
IIRC, MOE networks use a gating function to decide which experts to query. If you ignored this gating and just use all the experts, I think that'd break the model. In contrast, you can use all APD components on a forward pass if you want. Most of them just won't affect the result much.

MOE experts don't completely ignore 'simplicity' as we define it in the paper though. A single expert is simpler than the whole MOE network in that it has lower rank/ fewer numbers are required to describe its state on any given forward pass.

On the Rationality of Deterring ASI

Lucius Bushnaq2mo72

Why would this be restricted to cyber attacks? If the CCP believed that ASI was possible, even if they didn't believe in the alignment problem, the US developing an ASI would plausibly constitute an existential threat to them. It'd mean they lose the game of geopolitics completely and permanently. I don't think they'd necessarily restrict themselves to covert sabotage in such a situation.

1NickH1mo

The CCP has no reason to believe that the US is even capable of achieving ASI let alone whether they have an advantage over the CCP. No rational actor will go to war over a possibility of a maybe when the numbers could, just as likely be in their favour. E.g. If DeepSeek can almost equal OpenAI with less resources, it would be rational to allocate more resources to DeepSeek before doing anything as risky as trying to sabotage OpenAI that is uncertain to succeeed and more likely to invite uncontrollable retaliatory escalation.

5Vladimir_Nesov2mo

I'm quibbling with cyberattacks specifically being used as a central example throughout in the document and also on the podcasts. They do mention other kinds of attacks, see How to Maintain a MAIM Regime:

On the Rationality of Deterring ASI

Lucius Bushnaq2mo155

The possibility of stability through dynamics like mutually assured destruction has been where a lot of my remaining hope on the governance side has come from for a while now.

A big selling point of this for me is that it does not strictly require countries to believe that ASI is possible and that the alignment problem is real. Just believing that ASI is possible is enough.