Lucius Bushnaq

AI notkilleveryoneism researcher, focused on interpretability. 

Personal account, opinions are my own. 

I have signed no contracts or agreements whose existence I cannot mention.

Wikitag Contributions

Comments

Sorted by

Thank you. Do you know anyone who claims to have observed it?

If terminal lucidity is a real phenomenon, information lost to dementia could still be recoverable in principle. So, cryo-preserving people suffering from dementia for later mind uploading could still work sometimes.

I just heard about terminal lucidity for the first time from Janus:

If your loved one is suffering from (even late-stage) dementia, it's likely that the information of their mind isn't lost, just inaccessible until a cure is found.

Sign them up for cryonics.

This seems pretty important if true. I'd previously thought that if a loved one came down with Alzheimer's, that was likely the end for them in this branch of the world[1], even with cryonics. I'd planned to set up some form of assisted suicide for myself if I was ever diagnosed, to get frozen before my brain got damaged too much.

Skimming the Wikipedia article and the first page of Google results, the documentation we have of terminal lucidity doesn’t seem great. But it tentatively looks to me like it’s probably a real thing at least in some form? Though I guess with the relative rarity of clearly documented cases, it might actually only work for some specific neurological disorders. I find it somewhat hard to imagine how something like this could work with a case of severe Alzheimer's. Doesn't that literally atrophy your brain? 

This is very much not my wheelhouse though. I'd appreciate other people's opinions, especially if they know something about this area of research.

  1. ^

    It seems maybe possible in physical principle to bring back even minds lost to thermodynamic chaos. But that seems like an engineering undertaking so utterly massive I'm not sure even a mature civilisation controlling most of the lightcone could pull it off.

I agree it’s not a valid argument. I’m not sure about ‘dishonest’ though. They could just be genuinely confused about this. I was surprised how many people in machine learning seem to think the universal approximation theorem explains why deep learning works.

Anecdotally, the effect of LLMs on my workflow hasn't been very large. 

At a moderate P(doom), say under 25%, from a selfish perspective it makes sense to accelerate AI if it increases the chance that you get to live forever, even if it increases your risk of dying. I have heard from some people that this is their motivation.

If this is you: Please just sign up for cryonics. It's a much better immortality gambit than rushing for ASI.

I like AE Studios. They seem to genuinely care about AI not killing everyone, and have been willing to actually back original research ideas that don't fit into existing paradigms. 

Side note:

Previous posts have been met with great reception by the likes of Eliezer Yudkowsky and Emmett Shear, so we’re up to something good. 

This might be a joke, but just in case it's not: I don't think you should reason about your own alignment research agenda like this. I think Eliezer would probably be the first person to tell you that.

But they'd be too unchanged: the "afraid of mice" circuit would still be checking for "grey and big and mammal and ..." as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for "grey and big and mammal and ... and high-scrabble-scoring". Any interpretability tool that told you that "grey and big and mammal and ..." was "elephant" in the first model is now going to have difficulty representing the situation.

Thank you, this is a good example of a type-of-thing to watch out for in circuit interpretation. I had not thought of this before. I agree that an interpretability tool that rounded those two circuits off to taking in the 'same' feature would be a bad interpretability tool. It should just show you that those two circuits exist, and have some one dimensional features they care about, and those features are related but non-trivially distinct.

But this is not at all unique to the sort of model used in the counterexample. A 'normal' model can still have one embedding direction for elephant  at one point, used by a circuit , then in fine tuning switch to a slightly different embedding direction . Maybe it learned more features in fine tuning, some of those features are correlated with elephants and ended up a bit too close in cosine similarity to , and so interference can be lowered my moving the embedding around a bit. A circuit  learned in fine tuning would then be reading from this  and not match  which is still reading in . You might argue that  will surely want to adjust to start using  as well to lower the loss, but that would seem to apply equally well to your example. So I don't see how this is showing that the model used in the original counterexample has no notion of an elephant in a sense that does not also apply to the sort of models people might tend to imagine when they think in the conventional SDL paradigm.

EDIT: On a second read, I think I misunderstood you here. You seem to think the crucial difference is that the delta between and  is mostly 'unstructured', whereas the difference between  "grey and big and mammal and ..." and "grey and big and mammal and ... and high-scrabble-scoring" is structured. I don't see why that should matter though. So long as our hypothetical interpretability tool is precise enough to notice the size of the discrepancy between those features and not throw them into the same pot, we should be fine. For that, it wouldn't seem to me to really matter much whether the discrepancy is 'meaningful' to the model or not. 

 

I'm with @chanind: If elephant is fully represented by a sum of its attributes, then it's quite reasonable to say that the model has no fundamental notion of an elephant in that representation.
...

This is not a load-bearing detail of the example. If you like, you can instead imagine a model that embeds 1000 animals in an e.g. 50-dimensional subspace, with a 50 dimensional sub-sub-space where the embedding directions correspond to 50 attributes, and a 50 dimensional sub-sub-space where embeddings are just random. 

This should still get you basically the same issues the original example did I think? For any dictionary decomposition of the activations you pick, some of the circuits will end up looking like a horrible mess, even though they're secretly taking in a very low-rank subspace of the activations that'd make sense to us if we looked at it. I should probably double check that when I'm more awake though.[1] 

I think the central issue here is mostly just having some kind of non-random, 'meaningful' feature embedding geometry that the circuits care about, instead of random feature embeddings. 


 

  1. ^

    EDIT: I am now more awake. I still think this is right.

The kind of 'alignment technique' that successfully points a dumb model in the rough direction of doing the task you want in early training does not necessarily straightforwardly connect to the kind of 'alignment technique' that will keep a model pointed quite precisely in the direction you want after it gets smart and self-reflective.

For a maybe not-so-great example, human RL reward signals in the brain used to successfully train and aim human cognition from infancy to point at reproductive fitness. Before the distributional shift, our brains usually neither got completely stuck in reward-hack loops, nor used their cognitive labour for something completely unrelated to reproductive fitness. After the distributional shift, our brains still don't get stuck in reward-hack loops that much and we successfully train to intelligent adulthood. But the alignment with reproductive fitness is gone, or at least far weaker.

How much money would you guess was lost on this?

Load More