I went through and updated my 2022 “Intro to Brain-Like AGI Safety” series. If you already read it, no need to do so again, but in case you’re curious for details, I put changelogs at the bottom of each post. For a shorter summary of major changes, see this twitter thread, which I copy below (without the screenshots & links):
I’ve learned a few things since writing “Intro to Brain-Like AGI safety” in 2022, so I went through and updated it! Each post has a changelog at the bottom if you’re curious. Most changes were in one the following categories: (1/7)
REDISTRICTING! As I previously posted ↓, I booted the pallidum out of the “Learning Subsystem”. Now it’s the cortex, striatum, & cerebellum (defined expansively, including amygdala, hippocampus, lateral septum, etc.) (2/7)
LINKS! I wrote 60 posts since first finishing that series. Many of them elaborate and clarify things I hinted at in the series. So I tried to put in links where they seemed helpful. For example, I now link my “Valence” series in a bunch of places. (3/7)
NEUROSCIENCE! I corrected or deleted a bunch of speculative neuro hypotheses that turned out wrong. In some early cases, I can’t even remember wtf I was ever even thinking! Just for fun, here’s the evolution of one of my main diagrams since 2021: (4/7)
EXAMPLES! It never hurts to have more examples! So I added a few more. I also switched the main running example of Post 13 from “envy” to “drive to be liked / admired”, partly because I’m no longer even sure envy is related to social instincts at all (oops) (5/7)
LLMs! … …Just kidding! LLMania has exploded since 2022 but remains basically irrelevant to this series. I hope this series is enjoyed by some of the six remaining AI researchers on Earth who don’t work on LLMs. (I did mention LLMs in a few more places though ↓ ) (6/7)
If you’ve already read the series, no need to do so again, but I want to keep it up-to-date for new readers. Again, see the changelogs at the bottom of each post for details. I’m sure I missed things (and introduced new errors)—let me know if you see any!
I’m intrigued by the reports (including but not limited to the Martin 2020 “PNSE” paper) that people can “become enlightened” and have a radically different sense of self, agency, etc.; but friends and family don’t notice them behaving radically differently, or even differently at all. I’m trying to find sources on whether this is true, and if so, what’s the deal. I’m especially interested in behaviors that (naïvely) seem to centrally involve one’s self-image, such as “applying willpower” or “wanting to impress someone”. Specifically, if there’s a person whose sense-of-self has dissolved / merged into the universe / whatever, and they nevertheless enact behaviors that onlookers would conventionally put into one of those two categories, then how would that person describe / conceptualize those behaviors and why they occurred? (Or would they deny the premise that they are still exhibiting those behaviors?) Interested in any references or thoughts, or email / DM me if you prefer. Thanks in advance!
(Edited to add: Ideally someone would reply: “Yeah I have no sense of self, and also I regularly do things that onlookers describe as ‘applying willpower’ and/or ‘trying to impress someone’. And when that happens, I notice the following sequence of thoughts arising: [insert detailed description]”.)
[also posted on twitter where it got a bunch of replies including one by Aella.]
I’ll give it a go.
I’m not very comfortable with the term enlightened but I’ve been on retreats teaching non-dual meditation, received ‘pointing out instructions’ in the Mahamudra tradition and have experienced some bizarre states of mind where it seemed to make complete sense to think of a sense of awake awareness as being the ground thing that was being experienced spontaneously, with sensations, thoughts and emotions appearing to it — rather than there being a separate me distinct from awareness that was experiencing things ‘using my awareness’, which is how it had always felt before.
When I have (or rather awareness itself has) experienced clear and stable non-dual states the normal ‘self’ stuff still appears in awareness and behaves fairly normally (e.g there’s hunger, thoughts about making dinner, impulses to move the body, the body moving around the room making dinner…). Being in that non dual state seemed to add a very pleasant quality of effortlessness and okayness to the mix but beyond that it wasn’t radically changing what the ‘small self’ in awareness was doing.
If later the thought “I want to eat a second portion of ice cream” came up followed by “I should apply some self control. I better not do that.” they would just be things appearing to awareness.
Of course another thing in awareness is the sense that awareness is aware of itself and the fact that everything feels funky and non-dual at the moment. You’d think that might change the chain of thoughts about the ‘small self’ wanting ice cream and then having to apply self control towards itself.
In fact the first few times I had intense non-dual experiences there was a chain of thoughts that went “what the hell is going on? I’m not sure I like this? What if I can’t get back into the normal dualistic state of mind?” followed by some panicked feelings and then the non-dual state quickly collapsing into a normal dualistic state.
With more practice, doing other forms of meditation to build a stronger base of calmness and self-compassion, I was able to experience the non-dual state and the chain of thoughts that appeared would go more like “This time let’s just stick with it a bit longer. Basically no one has a persistent non-dual experience that lasts forever. It will collapse eventually whether you like it or not. Nothing much has really changed about the contents of awareness. It’s the same stuff just from a different perspective. I’m still obviously able to feel calmness and joyfulness, I’m still able to take actions that keep me safe — so it’s fine to hang out here”. And then thoughts eventually wander around to ice cream or whatever. And, again, all this is just stuff appearing within a single unified awake sense of awareness that’s being labelled as the experiencer (rather than the ‘I’ in the thoughts above being the experiencer).
The fact that thoughts referencing the self are appearing in awareness whilst it’s awareness itself that feels like the experiencer doesn’t seem to create as many contradictions as you would expect. I presume that’s partly because awareness itself, is able to be aware of its own contents but not do much else. It doesn’t for example make decisions or have a sense of free will like the normal dualistic self. Those again would just be more appearances in awareness.
However it’s obvious that awareness being spontaneously aware of itself does change things in important and indirect ways. It does change the sequences of thoughts somehow and the overall feeling tone — and therefore behaviour. But perhaps in less radical ways than you would expect. For me, at different times, this ranged from causing a mini panic attack that collapsed the non-dual state (obviously would have been visible from the outside) to subtly imbuing everything with nice effortlessness vibes and taking the sting out of suffering type experiences but not changing my thought chains and behaviour enough to be noticeable from the outside to someone else.
Disclaimer: I felt unsure at several points writing this and I’m still quite new to non-dual experiences. I can’t reliably generate a clear non-dual state on command, it’s rather hit and miss. What I wrote above is written from a fairly dualistic state relying on memories of previous experiences a few days ago. And it’s possible that the non-dual experience I’m describing here is still rather shallow and missing important insights versus what very accomplished meditators experience.
I won't claim that I'm constantly in a self of non-self, but as I'm writing this, I don't really feel that I'm locally existing in my body. I'm rather the awareness of everything that continuously arises in consciousness.
This doesn't happen all the time, I won't claim to be enlightened or anything but maybe this n=1 self-report can help?
Even from this state of awareness, there's still a will to do something. It is almost like you're a force of nature moving forward with doing what you were doing before you were in a state of presence awareness. It isn't you and at the same time it is you. Words are honestly quite insufficient to describe the experience, but If I try to conceptualise it, I'm the universe moving forward by itself. In a state of non-duality, the taste is often very much the same no matter what experience is arising.
There are some times when I'm not fully in a state of non-dual awareness when it can feel like "I" am pretending to do things. At the same time it also kind of feels like using a tool? The underlying motivation for action changes to something like acceptance or helpfulness, and in order to achieve that, there's this tool of the self that you can apply.
I'm noticing it is quite hard to introspect and try to write from a state of presence awareness at the same time but hopefully it was somewhat helpful?
Could you give me some experiments to try from a state of awareness? I would be happy to try them out and come back.
Extra (relation to some of the ideas): In the Mahayana wisdom tradition, explored in Rob Burbea's Seeing That Frees, there's this idea of emptiness, which is very related to the idea of non-dual perception. For all you see is arising from your own constricted view of experience, and so it is all arising in your own head. Realising this co-creation can enable a freedom of interpretation of your experiences.
Yet this view is also arising in your mind, and so you have "emptiness of emptiness," meaning that you're left without a basis. Therefore, both non-self and self are false but magnificent ways of looking at the world. Some people believe that the non-dual is better than the dual yet as my Thai Forest tradition guru Ajhan Buddhisaro says, "Don't poopoo the mind." The self boundary can be both a restricting and very useful concept, it is just very nice to have the skill to see past it and go back to the state of now, of presence awareness.
Emptiness is a bit like deeply seeing that our beliefs are built up from different axioms and being able to say that the axioms of reality aren't based on anything but probabilistic beliefs. Or seeing that we have Occam's razor because we have seen it work before, yet that it is fundamentally completely arbitrary and that the world just is arising spontaneously from moment to moment. Yet Occam's razor is very useful for making claims about the world.
I'm not sure if that connection makes sense, but hopefully, that gives a better understanding of the non-dual understanding of the self and non-self. (At least the Thai Forest one)
Many helpful replies! Here’s where I’m at right now (feel free to push back!) [I’m coming from an atheist-physicalist perspective; this will bounce off everyone else.]
Hypothesis:
Normies like me (Steve) have an intuitive mental concept “Steve” which is simultaneously BOTH (A) Steve-the-human-body-etc AND (B) Steve-the-soul / consciousness / wellspring of vitalistic force / what Dan Dennett calls a “homunculus” / whatever.
The (A) & (B) “Steve” concepts are the same concept in normies like me, or at least deeply tangled together. So it’s hard to entertain the possibility of them coming apart, or to think through the consequences if they do.
Some people can get into a Mental State S (call it a form of “enlightenment”, or pick your favorite terminology) where their intuitive concept-space around (B) radically changes—it broadens, or disappears, or whatever. But for them, the (A) mental concept still exists and indeed doesn’t change much.
Anyway, people often have thoughts that connect sense-of-self to motivation, like “not wanting to be embarrassed” or “wanting to keep my promises”. My central claim that the relevant sense-of-self involved in that motivation is (A), not (B).
If we conflate (A) & (B)—as normies like me are intuitively inclined to do—then we get the intuition that a radical change in (B) must have radical impacts on behavior. But that’s wrong—the (A) concept is still there and largely unchanged even in Mental State S, and it’s (A), not (B), that plays a role in those behaviorally-important everyday thoughts like “not wanting to be embarrassed” or “wanting to keep my promises”. So radical changes in (B) would not (directly) have the radical behavioral effects that one might intuitively expect (although it does of course have more than zero behavioral effect, with self-reports being an obvious example).
End of hypothesis. Again, feel free to push back!
Some meditators say that before you can get a good sense of non-self you first have to have good self-confidence. I think I would tend to agree with them as it is about how you generally act in the world and what consequences your actions will have. Without this the support for the type B that you're talking about can be very hard to come by.
Otherwise I do really agree with what you say in this comment.
There is a slight disagreement with the elaboration though, I do not actually think that makes sense. I would rather say that the (A) that you're talking about is more of a software construct than it is a hardware construct. When you meditate a lot, you realise this and get access to the full OS instead of just the specific software or OS emulator. A is then an evolutionary beneficial algorithm that runs a bit out of control (for example during childhood when we attribute all cause and effect to our "selves").
Meditation allows us to see that what we have previously attributed to the self was flimsy and dependent on us believing that the hypothesis of the self is true.
My experience is different from the two you describe. I typically fully lack (A)[1], and partially lack (B). I think this is something different from what others might describe as 'enlightenment'.
I might write more about this if anyone is interested.
At least the 'me-the-human-body' part of the concept. I don't know what the '-etc' part refers to.
I just made a wording change from:
Normies like me have an intuitive mental concept “me” which is simultaneously BOTH (A) me-the-human-body-etc AND (B) me-the-soul / consciousness / wellspring of vitalistic force / what Dan Dennett calls a “homunculus” / whatever.
to:
Normies like me (Steve) have an intuitive mental concept “Steve” which is simultaneously BOTH (A) Steve-the-human-body-etc AND (B) Steve-the-soul / consciousness / wellspring of vitalistic force / what Dan Dennett calls a “homunculus” / whatever.
I think that’s closer to what I was trying to get across. Does that edit change anything in your response?
At least the 'me-the-human-body' part of the concept. I don't know what the '-etc' part refers to.
The “etc” would include things like the tendency for fingers to reactively withdraw from touching a hot surface.
Elaborating a bit: In my own (physicalist, illusionist) ontology, there’s a body with a nervous system including the brain, and the whole mental world including consciousness / awareness is inextricably part of that package. But in other people’s ontology, as I understand it, some nervous system activities / properties (e.g. a finger reactively withdrawing from pain, maybe some or all other desires and aversions) gets lumped in with the body, whereas other [things that I happen to believe are] nervous system activities / properties (e.g. awareness) gets peeled off into (B). So I said “etc” to include all the former stuff. Hopefully that’s clear.
(I’m trying hard not to get sidetracked into an argument about the true nature of consciousness—I’m stating my ontology without defending it.)
I think that’s closer to what I was trying to get across. Does that edit change anything in your response?
No.
Overall, I would say that my self-concept is closer to what a physicalist ontology implies is mundanely happening - a neural network, lacking a singular 'self' entity inside it, receiving sense data from sensors and able to output commands to this strange, alien vessel (body). (And also I only identify myself with some parts of the non-mechanistic-level description of what the neural network is doing).
I write in a lot more detail below. This isn't necessarily written at you in particular, or with the expectation of you reading through all of it.
1. Non-belief in self-as-body (A)
I see two kinds of self-as-body belief. The first is looking in a mirror, or at a photo, and thinking, "that [body] is me." The second is controlling the body, and having a sense that you're the one moving it, or more strongly, that it is moving because it is you (and you are choosing to move).
I'll write about my experiences with the second kind first.
The way a finger automatically withdraws from heat does not feel like a part of me in any sense. Yesterday, I accidentally dropped a utensil and my hands automatically snapped into place around it somehow, and I thought something like, "woah, I didn't intend to do that. I guess it's a highly optimized narrow heuristic, from times where reacting so quickly was helpful to survival".
I experimented a bit between writing this, and I noticed one intuitive view I can have of the body is that it's some kind of machine that automatically follows such simple intents about the physical world (including intents that I don't consider 'me', like high fear of spiders). For example, if I have motivation and intent to open a window, then the body just automatically moves to it and opens it without me really noticing that the body itself (or more precisely, the body plus the non-me nervous/neural structure controlling it) is the thing doing that - it's kind of like I'm a ghost (or abstract mind) with telekinesis powers (over nearby objects), but then we apply reductive physics and find that actually there's a causal chain beneath the telekinesis involving a moving body (which I always know and can see, I just don't usually think about it).
The way my hands are moving on the keyboard as I write this also doesn't particularly feel like it's me doing that; in my mind, I'm just willing the text to be written, and then the movement happens on its own, in a way that feels kind of alien if I actually focus on it (as if the hands are their own life form).
That said, this isn't always true. I do have an 'embodied self-sense' sometimes. For example, I usually fall asleep cuddling stuffies because this makes me happy. At least some purposeful form of sense-of-embodiment seems present there, because the concept of cuddling has embodiment as an assumption.[1]
(As I read over the above, I wonder how different it really is from normal human experience. I'm guessing there's a subtle difference between "being so embodied it becomes a basic implicit assumption that you don't notice" and "being so nonembodied that noticing it feels like [reductive physics metaphor]")
As for the first kind mentioned of locating oneself in the body's appearance, which informs typical humans perception of others and themself - I don't experience this with regard to myself (and try to avoid being biased about others this way), instead I just feel pretty dissociated when I see my body reflected and mostly ignore it.
In the past, it instead felt actively stressful/impossible/horrifying, because I had (and to an extent still do have) a deep intuition that I am already a 'particular kind of being', and, under the self-as-body ontology, this is expected to correspond to a particular kind of body, one which I did not observe reflected back. As this basic sense-of-self violation happened repeatedly, it gradually eroded away this aspect of sense-of-self / the embodied ontology.
I'd also feel alienated if I had to pilot an adult body to interact with others, so I've set up my life such that I only minimally need to do that (e.g for doctors appointments) and can otherwise just interact with the world through text.
2. What parts of the mind-brain are me, and what am I? (B)
I think there's an extent to which I self-model as an 'inner homunculus', or a 'singular-self inside'. I think it's lesser and not as robust in me as it is in typical humans, though. For example, when I reflect on this word 'I' that I keep using, I notice it has a meaning that doesn't feel very true of me: the meaning of a singular, unified entity, rather than multiple inner cognitive processes, or no self in particular.
I often notice my thoughts are coming from different parts of the mind. In one case, I was feeling bad about not having been productive enough in learning/generating insights and I thought to myself, "I need to do better", and then felt aware that it was just one lone part thinking this while the rest doesn't feel moved; the rest instead culminates into a different inner-monologue-thought: something like, "but we always need to do better. tsuyoku naratai is a universal impetus." (to be clear, this is not from a different identity or character, but from different neural processes causally prior to what is thought (or written).)
And when I'm writing (which forces us to 'collapse' our subverbal understanding into one text), it's noticeable how much a potential statement is endorsed by different present influences[2].
I tend to use words like 'I' and 'me' in writing to not confuse others (internally, 'we' can feel more fitting, referring again to multiple inner processes[2], and not to multiple high-level selves as some humans experience. 'we' is often naturally present in our inner monologue). We'll use this language for most of the rest of the text[3].
There are times where this is less true. Our mind can return to acting as a human-singular-identity-player in some contexts. For example, if we're interacting with someone or multiple others, that can push us towards performing a 'self' (but unless it's someone we intuitively-trust and relatively private, we tend to feel alienated/stressed from this). Or if we're, for example, playing a game with a friend, then in those moments we'll probably be drawn back into a more childlike humanistic self-ontology rather than the dissociated posthumanism we describe here.
Also, we want to answer "what inner processes?" - there's some division between parts of the mind-brain we refer to here, and parts that are the 'structure' we're embedded in. We're not quite sure how to write down the line, and it might be fuzzy or e.g contextual.[4]
3. Tracing the intuitive-ontology shift
"Why are you this way, and have you always been this way?" – We haven't always. We think this is the result of a gradual erosion of the 'default' human ontology, mentioned once above.
We think this mostly did not come from something like 'believing in physicalism'. Most physicalists aren't like this. Ontological crises may have been part of it, though - independently synthesizing determinism as a child and realizing it made naive free will impossible sure did make past-child-quila depressed.
We think the strongest sources came from 'intuitive-ontological'[5] incompatibilities, ways the observations seemed to sadly-contradict the platonic self-ontology we started with. Another term for these would be 'survival updates'. This can also include ways one's starting ontology was inadequate to explain certain important observations.
Also, I think that existing so often in a digital-informational context[6], and only infrequently in an analog/physical context, also contributed to eroding the self-as-body belief.
Also, eventually, it wasn't just erosion/survival updates; at some point, I think I slowly started to embrace this posthumanist ontology, too. It feels narratively fitting that I'm now thinking about artificial intelligence and reading LessWrong.
(There is some sense in which maybe, my proclaimed ontology has its source in constant dissociation, which I only don't experience when feeling especially comfortable/safe. I'm only speculating, though - this is the kind of thing that I'd consider leaving out, since I'm really unsure about it, it's at the level of just one of many passing thoughts I'd consider.)
This 'inner proccesses' phrasing I keep using doesn't feel quite right. Other words that come to mind: considerations? currently-active neural subnetworks? subagents? some kind of neural council metaphor?
(sometimes 'we' feels unfitting too, it's weird, maybe 'I' is for when a self is being more-performed, or when text is less representative of the whole, hard to say)
We tried to point to some rough differences, but realized that the level we mean is somewhere between high-level concepts with words (like 'general/narrow cognition' and 'altruism' and 'biases') and the lowest-level description (i.e how actual neurons are interacting physically), and that we don't know how to write about this.
We can differentiate between an endorsed 'whole-world ontology' like physicalism, and smaller-scale intuitive ontologies that are more like intuitive frames we seem to believe in, even if when asked we'll say they're not fundamental truths.
The intuitive ontology of the self is particularly central to humans.
Note this was mostly downstream of other factors, not causally prior to them. I don't want anyone to read this and think internet use itself causes body-self incongruence, though it might avoid certain related feedback loops.
Some ultra-short book reviews on cognitive neuroscience
On Intelligence by Jeff Hawkins & Sandra Blakeslee (2004)—very good. Focused on the neocortex - thalamus - hippocampus system, how it's arranged, what computations it's doing, what's the relation between the hippocampus and neocortex, etc. More on Jeff Hawkins's more recent work here.
I am a strange loop by Hofstadter (2007)—I dunno, I didn't feel like I got very much out of it, although it's possible that I had already internalized some of the ideas from other sources. I mostly agreed with what he said. I probably got more out of watching Hofstadter give a little lecture on analogical reasoning (example) than from this whole book.
Consciousness and the brain by Dehaene (2014)—very good. Maybe I could have saved time by just reading Kaj's review, there wasn't that much more to the book beyond that.
Conscience by Patricia Churchland (2019)—I hated it. I forget whether I thought it was vague / vacuous, or actually wrong. Apparently I have already blocked the memory!
How to Create a Mind by Kurzweil (2014)—Parts of it were redundant with On Intelligence (which I had read earlier), but still worthwhile. His ideas about how brain-computer interfaces are supposed to work (in the context of cortical algorithms) are intriguing; I'm not convinced, hoping to think about it more.
Rethinking Consciousness by Graziano (2019)—A+, see my review here
The Accidental Mind by Linden (2008)—Lots of fun facts. The conceit / premise (that the brain is a kludgy accident of evolution) is kinda dumb and overdone—and I disagree with some of the surrounding discussion—but that's not really a big part of the book, just an excuse to talk about lots of fun neuroscience.
The Myth of Mirror Neurons by Hickok (2014)—A+, lots of insight about how cognition works, especially the latter half of the book. Prepare to skim some sections of endlessly beating a dead horse (as he dubunks seemingly endless lists of bad arguments in favor of some aspect of mirror neurons). As a bonus, you get treated to an eloquent argument for the "intense world" theory of autism, and some aspects of predictive coding.
Surfing Uncertainty by Clark (2015)—I liked it. See also SSC review. I think there's still work to do in fleshing out exactly how these types of algorithms work; it's too easy to mix things up and oversimplify when just describing things qualitatively (see my feeble attempt here, which I only claim is a small step in the right direction).
Rethinking innateness by Jeffrey Elman, Annette Karmiloff-Smith, Elizabeth Bates, Mark Johnson, Domenico Parisi, and Kim Plunkett (1996)—I liked it. Reading Steven Pinker, you get the idea that connectionists were a bunch of morons who thought that the brain was just a simple feedforward neural net. This book provides a much richer picture.
I think there’s a connection between (A) a common misconception in thinking about future AI (that it’s not a huge deal if it’s “only” about as good as humans at most things), and (B) a common misconception in economics (the “Lump Of Labor Fallacy”).
So I started writing a blog post elaborating on that, but got stuck because my imaginary reader is not an economist and kept raising objections that amounted to saying “yeah but the Lump Of Labor Fallacy isn’t actually a fallacy, there really is a lump of labor” 🤦
Anyway, it’s bad pedagogy to explain a possibly-unintuitive thing by relating it to a different possibly-unintuitive thing. Oh well. (I might still try again to finish writing it at some point.)
It matters a lot what specifically it means to be "as good at humans at most things". The vast majority of jobs include both legible, formal tasks and "be a good employee" requirements, much more nebulous and difficult to measure. Being just as good as the median employee at the formal job description, without the flexibility and trust from being a functioning member of society is NOT enough to replace most workers. It'll replace some, of course.
That said, the fact that "lump of labor" IS a fallacy, and there's not a fixed amount of work to be done, which more workers simply spread more thinly, means that it's OK if it displaces many workers - there will be other things they can valuably do.
By that argument, human-level AI is effectively just immigration.
Being just as good as the median employee at the formal job description, without the flexibility and trust from being a functioning member of society is NOT enough to replace most workers. It'll replace some, of course.
Yup, the context was talking about future AI which can e.g. have the idea that it will found a company, and then do every step to make that happen, and it can do that about as well as the best human (but not dramatically better than the best human).
I definitely sometimes talk to people who say “yes, I agree that that scenario will happen eventually, but it will not significantly change the world. AI would still be just another technology.” (As opposed to “…and then obviously 99.99…% of future companies will be founded by autonomous AIs, because if it becomes possible to mass-produce Jeff Bezos-es by the trillions, then that’s what will happen. And similarly in every other aspect of the economy.)
By that argument, human-level AI is effectively just immigration.
I think “the effective global labor pool increases by a factor of 1000, consisting of 99.9% AIs” is sometimes a useful scenario to bring up in conversation, but it’s also misleading in certain ways. My actual belief is that humans would rapidly have no ability to contribute to the economy in a post-AGI world, for a similar reason as a moody 7-year-old has essentially no ability to contribute to the economy today (in fact, people would pay good money to keep a moody 7-year-old out of their office or factory).
Dear diary...
[this is an experiment in just posting little progress reports as a self-motivation tool.]
1. I have a growing suspicion that I was wrong to lump the amygdala in with the midbrain. It may be learning by the same reward signal as the neocortex. Or maybe not. It's confusing. Things I'm digesting: https://twitter.com/steve47285/status/1314553896057081857?s=19 (and references therein) and https://www.researchgate.net/publication/11523425_Parallels_between_cerebellum-_and_amygdala-dependent_conditioning
2. Speaking of mistakes, I'm also regretting some comments I made a while ago suggesting that the brain doesn't do backpropagation. Maybe that's true in a narrow sense, but Randall O'Reilly has convinced me that the brain definitely does error-driven learning sometimes (I already knew that), and moreover it may well be able to propagate errors through at least one or two layers of a hierarchy, with enough accuracy to converge. No that doesn't mean that the brain is exactly the same as a PyTorch / Tensorflow Default-Settings Deep Neural Net.
3. My long work-in-progress post on autism continues to be stuck on the fact that there seem to be two theories of social impairment that are each plausible and totally different. In one theory, social interactions are complex and hard to follow / model for cognitive / predictive-model-building reasons. The evidence I like for that is the role of the cerebellum, which sounds awfully causally implicated in autism. Like, absence of a cerebellum can cause autism, if I'm remembering right. In the other theory, modeling social interactions in the neurotypical way (via empathy) is aversive. The evidence I like for that is people with autism self-reporting that eye contact is aversive, among other things. (This is part of "intense world theory".) Of those two stories, I'm roughly 100% sold on the latter story is right. But the former story doesn't seem obviously wrong, and I don't like having two explanations for the same thing (although it's not impossible, autism involves different symptoms in different people, and they could co-occur for biological reasons rather than computational reasons). I'm hoping that the stories actually come together somehow, and I'm just confused about what the cerebellum and amygdala do. So I'm reading and thinking about that.
4. New theory I'm playing with: the neocortex outputs predictions directly, in addition to motor commands. E.g. "my arm is going to be touched". Then the midbrain knows not to flinch when someone touches the arm. That could explain why the visual cortex talks to the superior colliculus, which I always thought was weird. Jeff Hawkins says those connections are the neocortex sending out eye movement motor commands, but isn't that controlled by the frontal eye fields? Oh, then Randall O'Reilly had this mysterious throwaway comment in a lecture that the frontal eye fields seem to be at the bottom of the visual hierarchy if you look at the connections. (He had a reference, I should read it.) I don't know what the heck is going on.
modeling social interactions in the neurotypical way (via empathy) is aversive
Is it too pessimistic to assume that people mostly model other people in order to manipulate them better? I wonder how much of human mental inconsistency is a defense against modeling. Here on Less Wrong we complain that inconsistent behavior makes you vulnerable to Dutch-booking, but in real life, consistent behavior probably makes you even more vulnerable, because your enemies can easily predict what you do and plan accordingly.
I was just writing about my perspective here; see also Simulation Theory (the opposite of "Theory Theory", believe it or not!). I mean, you could say that "making friends and being nice to them" is a form of manipulation, in some technical sense, blah blah evolutionary game theory blah blah, I guess. That seems like something Robin Hanson would say :-P I think it's a bit too cynical if you mean "manipulation" in the everyday sense involving bad intent. Also, if you want to send out vibes of "Don't mess with me or I will crush you!" to other people—and the ability to make credible threats is advantageous for game-theory reasons—that's all about being predictable and consistent!
Again as I posted just now, I think the lion's share of "modeling", as I'm using the term, is something that happens unconsciously in a fraction of second, not effortful empathy or modeling.
Hmmm... If I'm trying to impress someone, I do indeed effortfully try to develop a model of what they're impressed by, and then use that model when talking to them. And I tend to succeed! And it's not all that hard! The most obvious strategy tends to work (i.e., go with what has impressed them in the past, or what they say would be impressive, or what impresses similar people). I don't really see any aspect of human nature that is working to make it hard for me to impress someone, like by a person randomly changing what they find impressive. Do you? Are there better examples?
I have low confidence debating this, because it seems to me like many things could be explained in various ways. For example, I agree that certain predictability is needed to prevent people from messing with you. On the other hand, certain uncertainty is needed, too -- if people know exactly when you would snap and start crushing them, they will go 5% below the line; but if the exact line depends on what you had for breakfast today, they will be more careful about getting too close to it.
Branding: 3 reasons why I prefer "AGI safety" to "AI alignment"
When engineers, politicians, bureaucrats, military leaders, etc. hear the word "safety", they suddenly perk up and start nodding and smiling. Safety engineering—making sure that systems robustly do what you want them to do—is something that people across society can relate to and appreciate. By contrast, when people hear the term "AI alignment" for the first time, they just don't know what it means or how to contextualize it.
There are a lot of things that people are working on in this space that aren't exactly "alignment"—things like boxing, task-limited AI, myopic AI, impact-limited AI, non-goal-directed AI, AGI strategy & forecasting, etc. It's useful to have a term that includes all those things, and I think that term should be "AGI safety". Then we can reserve "AI alignment" for specifically value alignment.
Actually, I'm not even sure that "value alignment" is exactly the right term for value alignment. The term "value alignment" is naturally read as something like "the AI's values are aligned with human values", which isn't necessarily wrong, but is a bit vague and not necessarily interpreted correctly. For example, if love is a human value, should the AGI adopt that value and start falling in love? No, they should facilitate humans falling in love. When people talk about CIRL, CEV, etc. it seems to be less about "value alignment" and more about "value indirection" (in the C++ sense), i.e. utility functions that involve human goals and values, and which more specifically define those things by pointing at human brains and human behavior.
A friend in the AI space who visited Washington told me that military leaders distinctly do not like the term "safety".
Right, for them "alignment" could mean their desired concept, "safe for everyone except our targets".
I'm skeptical that anyone with that level of responsibility and acumen has that kind of juvenile destructive mindset. Can you think of other explanations?
Can you think of other explanations?
There's a difference between people talking about safety in the sense of 1. 'how to handle a firearm safely' and the sense of 2. 'firearms are dangerous, let's ban all guns'. These leaders may understand/be on board with 1, but disagree with 2.
I think if someone negatively reacts to 'Safety' thinking you mean 'try to ban all guns' instead of 'teach good firearm safety', you can rephrase as 'Control' in that context. I think Safety is more inclusive of various aspects of the problem than either 'Control' or 'Alignment', so I like it better as an encompassing term.
Interesting. I guess I was thinking specifically about DARPA which might or might not be representative, but see Safe Documents, Safe Genes, Safe Autonomy, Safety and security properties of software, etc. etc.
In the era of COVID, we should all be doing cardio exercise if possible, and not at a gym. Here's what's been working for me for the past many years. This is not well optimized for perfectly working out every muscle group etc., but it is very highly optimized for convenience, practicality, and sustainability, at least for me personally in my life situation.
(This post is mostly about home cardio exercise, but the last paragraph is about jogging.)
My home exercise routine consists of three simultaneous things: {exercise , YouTube video lectures , RockMyRun}. More on the exercise below. RockMyRun is a site/app that offers music mixes at fixed BPMs—the music helps my energy and the fixed BPM keeps me from gradually slowing down the pace. The video lectures make me motivated to work out, since there's a lot of stuff I desperately want to learn. :)
Previously I've done instead {exercise, movies or TV}. (I still do on rare occasions.) This is motivating when combined with the rule of "no movies or TV unless exercising (or on social special occasions)". I've pretty much followed that rule for years now.
My exercise routine consists of holding a dumbbell in each hands, then doing a sort of simultaneous reverse-lunge while lifting one of the dumbbells, alternating sides, kinda like this picture. Out of numerous things I've tried, this is the one that stuck, because it's compatible with watching TV, compatible with very small spaces including low ceilings, has low risk of injury, doesn't stomp or make noise, doesn't require paying attention (once you get the hang of it), and seems to be a pretty good cardio workout (as judged by being able to break a sweat in a freezing cold room). I also do a few pushups now and then as a break, although that means missing what's on the screen. I've gradually increased the dumbbell weight over the years from 3lbs (1.4kg) to now 15lbs (7kg).
I strongly believe that the top priority for an exercise routine is whatever helps you actually keep doing it perpetually. But beyond that, I've found some factors that give me a more intense workout: Coffee helps slightly (it's a performance-enhancing drug! At least for some people); feeling cold at the beginning / being in a cold room seems to help; awesome action-packed movies or TV are a nice boost, but RockMyRun with boring video lectures is good enough. (My most intense workouts are watching music videos or concert recordings, but I get bored of those after a while.)
In other news, I also occasionally jog. RockMyRun is also a really good idea for that, not just for the obvious reasons (energy, pace), but because, when you set the BPM high, your running form magically and effortlessly improves. This completely solved my jogging knee pain problems, which I had struggled with for years. (I learned that tip from here, where he recommends 160BPM. I personally prefer 180BPM, because I like shorter and more intense runs for my time-crunched schedule.)
Quick comments on "The case against economic values in the brain" by Benjamin Hayden & Yael Niv :
(I really only skimmed the paper, these are just impressions off the top of my head.)
I agree that "eating this sandwich" doesn't have a reward prediction per se, because there are lots of different ways to think about eating this sandwich, especially what aspects are salient, what associations are salient, what your hormones and mood are, etc. If neuroeconomics is premised on reward predictions being attached to events and objects rather than thoughts, then I don't like neuroeconomics either, at least not as a mechanistic theory of psychology. [I don't know anything about neuroeconomics, maybe that was never the idea anyway.]
But when they float the idea of throwing out rewards altogether, I'm not buying it. The main reason is: I'm trying to understand what the brain does algorithmically, and I feel like I'm making progress towards a coherent picture ...and part of that picture is a 1-dimensional signal called reward. If you got rid of that, I just have no idea how to fill in that gap. Doesn't mean it's impossible, but I did try to think it through and failed.
There's also a nice biological story going with the algorithm story: the basal ganglia has a dense web of connections across the frontal lobe, and can just memorize "this meaningless set of neurons firing is associated with that reward, and this meaningless set of neurons firing is associated with that reward, etc. etc." Then it (1) inhibits all but the highest-reward-predicting activity, and (2) updates the reward predictions based on what happens (TD learning). (Again this and everything else is very sketchy and speculative.)
(DeepMind had a paper that says there's a reward prediction probability distribution instead of a reward prediction value, which is fine, that's still consistent with the rest of my story.)
I get how deep neural nets can search for a policy directly. I don't think those methods are consistent with the other things I believe about the brain (or at least the neocortex). In particular I think the brain does seem to have a mechanism for choosing among different possible actions being considered in parallel, as opposed to a direct learned function from sensory input to output. The paper also mentions learning to compare without learning a value, but I don't think that works because there are too many possible comparisons (the square of the number of possible thoughts).
Introducing AGI Safety in general, and my research in particular, to novices / skeptics, in 5 minutes, out loud
I might be interviewed on a podcast where I need to introduce AGI risk to a broad audience of people who mostly aren’t familiar with it and/or think it’s stupid. The audience is mostly neuroscientists plus some AI people. I wrote the following as a possible entry-point, if I get thrown some generic opening question like “Tell me about what you’re working on”:
The human brain does all these impressive things, such that humanity was able to transform the world, go to the moon, invent nuclear weapons, wipe out various species, etc. Human brains did all those things by running certain algorithms.
And sooner or later, people will presumably figure out how to run similar algorithms on computer chips.
Then what? That’s the million-dollar question. Then what? What happens when researchers eventually get to the point where they can run human-brain-like algorithms on computer chips?
OK, to proceed I need to split into two ways of thinking about these future AI systems: Like a tool or like a species.
Let's start with the tool perspective. Here I'm probably addressing the AI people in the audience. You're thinking, “Oh, you're talking about AI, well pfft, I know what AI is, I work with AI every day, AI is kinda like language models and ConvNets and AlphaFold and so on. By the time we get future algorithms that are more like how the human brain works, they're going to be more powerful, sure, but we should still think of them as in the same category as ConvNets, we should think of them like a tool that people will use.” OK, if that's your perspective, then the goal is for these tools to do the things that we want them to do. And conversely, the concern is that these systems could go about doing things that the programmers didn't want them to do, and that literally nobody wanted them to do, like try to escape human control. The technical problem here is called The Alignment Problem: If people figure out how to run human-brain-like algorithms on computer chips, and they want those algorithms to try to do X, how can they do that? It's not straightforward. For example, humans have an innate sex drive, but it doesn't work very reliably, some people choose to be celibate. OK, so imagine you have the source code for a human-like brain architecture and training environment, and you want it to definitely grow into an adult that really, deeply, wants to do some particular task, like let's say design solar cells, while also being honest and staying under human control. How would you do that? What exactly would you put into the source code? Nobody knows the answer. And when you dig into it you find that it's a surprisingly tricky technical problem, for pretty deep reasons. And that technical problem is something that I and others in the field are working on.
That was the tool perspective. But then there's probably another part of the audience, maybe a lot of the neuroscientists, who are strenuously objecting here: if we run human-brain-like algorithms on computer chips, we shouldn't think of that as like a tool for humans to use, instead we should think of it like a species, a new intelligent species that we have invited onto our planet, and indeed a species which will eventually think much faster than humans, and be more insightful and creative than humans, and also probably eventually outnumber humans by a huge factor, and so on. In that perspective, the question is: if we're going to invite this powerful new intelligent species onto our planet, how do we make sure that it's a species that we actually want to share the planet with? And how do we make sure that they want to continue sharing the planet with us? Or more generally, how do we bring about a good future? There are some interesting philosophy questions here which we can get back to, but putting those aside, there's also a technical problem to solve, which is, whatever properties we want this new intelligent species to have, we need to actually write source code such that that actually happens. For example, if we want this new species to feel compassion and friendship, we gotta put compassion and friendship into the source code. Human sociopaths are a case study here. Sociopaths exist, therefore it is possible to make an intelligent species that isn't motivated by compassion and friendship. Not just possible, but strictly easier! I think maybe future programmers will want to put compassion and friendship into the source code, but they won't know how, so they won't do it. So I say, let’s try to figure that out ahead of time. Again, I claim this is a very tricky technical problem, when you start digging into it. We can talk about why. Anyway, that technical problem is also something that I'm working on.
So in summary, sooner or later people will figure out how to run human-brain-like algorithms on computer chips, and this is a very very big deal, it could be the best or worst thing that's ever happened to humanity, and there's work we can do right now to increase the chance that things go well, including, in particular, technical work that involves thinking about algorithms and AI and reading neuroscience papers. And that's what I'm working on!
I’m open to feedback; e.g., where might skeptical audience-members fall off the boat? (I am aware that it’s too long for one answer; I expect that I’ll end up saying various pieces of this in some order depending on the flow of the conversation. But still, gotta start somewhere.)