AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
I suspect language model in-context learning[1] 'approximates Solomonoff induction' in the vague sense that it is a pattern matching thingy navigating a search space somewhat similar in character to the space of possible computer programs, consisting of inputs/parameters for some very universal, Turing-complete-ish computational architecture the lm expresses its guesses for patterns in, looking for a pattern that matches the data.
The way they navigate this search space is totally different from SI, which just checks every single point in its search space of UTM programs. But the geometry of the space is similar to the geometry of the space of UTM programs, with properties like simpler hypotheses corresponding to exponentially more points in the space.
So, even if the language models' in-context learning algorithm was kind of maximally stupid, and literally just guessed random points in the search space until it found a good match to the data, we'd expect its outputs to somewhat match up with the universal distribution, just because they're both ≈ uniformly random samples from a search of inputs to Turing-complete-ish computational architectures.
So, to the extent that these experimental results actually hold up[2], I think the main thing they'd be telling us is that the 'architecture' or 'language' the lm expresses its in-context guesses in is highly expressive, with a computational universality similar to that of UTMs and many neural network architectures.
Arguably, the later may be a special case of the former with an appropriate choice of universal Turing machine (UTM), but I find this perspective to be a bit of a stretch. At the very least I expect LLM ICL to be similar to a universal distribution conditioned on some background information.
What's even the difference between these propositions? Any UTM can be expressed in another UTM as a bit string of prior knowledge to condition on, and I'd intuitively expect the reverse to hold as well, though I don't actually know that for sure.
In other words, will the AGI actually want you to push the button? Or would it want some random weird thing because inner alignment is hard?
My answer is: yes, it would want you to push the button, at least if we’re talking about brain-like AGI, and if you set things up correctly.
Again, getting a brain-like AGI addicted to a reward button is a lot like getting a human or animal hooked on an addictive drug.
Humans addicted to drugs often exhibit weird meta-preferences like 'I want to stop wanting the drug', or 'I want to find an even better kind of drug'.
For this reason, I am not at all confident that a smart thing exposed to the button would later generalise to coherent, super-smart thing that wants the button to be pressed. Maybe it perceived the circuits in it that bound to the button reward as foreign to the rest of its goals, and worked to remove them. Maybe the button binding generalised in a strange way.
'Seek to directly inhabit the cognitive state caused by the button press', 'along an axis of cognitive states associated with button presses of various strength, seek to walk to a far end that does not actually correspond to any kind of button press ', 'make the world have a shape related to generalisations of ideas that tended to come up whenever the button was pressed' and just generally 'maximise a utility function made up of algorithmically simple combinations of button-related and pre-button-training-reward-related abstractions' all seem like goals I could imagine a cognitively enhanced human button addict generalising toward. So I am not confident the AGI would generalise to wanting the button to be pushed either, not in the long term.
Thank you. Do you know anyone who claims to have observed it?
If terminal lucidity is a real phenomenon, information lost to dementia could still be recoverable in principle. So, cryo-preserving people suffering from dementia for later mind uploading could still work sometimes.
I just heard about terminal lucidity for the first time from Janus:
If your loved one is suffering from (even late-stage) dementia, it's likely that the information of their mind isn't lost, just inaccessible until a cure is found.
Sign them up for cryonics.
This seems pretty important if true. I'd previously thought that if a loved one came down with Alzheimer's, that was likely the end for them in this branch of the world[1], even with cryonics. I'd planned to set up some form of assisted suicide for myself if I was ever diagnosed, to get frozen before my brain got damaged too much.
Skimming the Wikipedia article and the first page of Google results, the documentation we have of terminal lucidity doesn’t seem great. But it tentatively looks to me like it’s probably a real thing at least in some form? Though I guess with the relative rarity of clearly documented cases, it might actually only work for some specific neurological disorders. I find it somewhat hard to imagine how something like this could work with a case of severe Alzheimer's. Doesn't that literally atrophy your brain?
This is very much not my wheelhouse though. I'd appreciate other people's opinions, especially if they know something about this area of research.
It seems maybe possible in physical principle to bring back even minds lost to thermodynamic chaos. But that seems like an engineering undertaking so utterly massive I'm not sure even a mature civilisation controlling most of the lightcone could pull it off.
I agree it’s not a valid argument. I’m not sure about ‘dishonest’ though. They could just be genuinely confused about this. I was surprised how many people in machine learning seem to think the universal approximation theorem explains why deep learning works.
Anecdotally, the effect of LLMs on my workflow hasn't been very large.
At a moderate P(doom), say under 25%, from a selfish perspective it makes sense to accelerate AI if it increases the chance that you get to live forever, even if it increases your risk of dying. I have heard from some people that this is their motivation.
If this is you: Please just sign up for cryonics. It's a much better immortality gambit than rushing for ASI.
I like AE Studios. They seem to genuinely care about AI not killing everyone, and have been willing to actually back original research ideas that don't fit into existing paradigms.
Side note:
Previous posts have been met with great reception by the likes of Eliezer Yudkowsky and Emmett Shear, so we’re up to something good.
This might be a joke, but just in case it's not: I don't think you should reason about your own alignment research agenda like this. I think Eliezer would probably be the first person to tell you that.
I wouldn't expect UTM switching to be able to express any conditioning, that wouldn't make sense since conditioning can exclude TMs and UTMs can all express any TM. But that doesn't strike me as the sort of conditioning prior knowledge of the internet would impose?
Actually, now that I think about it, I guess it could be.