Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , X/Twitter , Bluesky , Mastodon , Threads , GitHub , Wikipedia , Physics-StackExchange , LinkedIn

Sequences

Intuitive Self-Models
Valence
Intro to Brain-Like-AGI Safety

Wiki Contributions

Comments

Sorted by

Hmm, I think the point I’m trying to make is: it’s dicey to have a system S that’s being continually modified to systematically reduce some loss L, but then we intervene to edit S in a way that increases L. We’re kinda fighting against the loss-reducing mechanism (be it gradient descent or bankroll-changes or whatever), hoping that the loss-reducing mechanism won’t find a “repair” that works around our interventions.

In that context, my presumption is that an AI will have some epistemic part S that’s continually modified to produce correct objective understanding of the world, including correct anticipation of the likely consequences of actions. The loss L for that part would probably be self-supervised learning, but could also include self-consistency or whatever.

And then I’m interpreting you (maybe not correctly?) as proposing that we should consider things like making the AI have objectively incorrect beliefs about (say) bioweapons, and I feel like that’s fighting against this L in that dicey way.

Whereas your Q-learning example doesn’t have any problem with fighting against a loss function, because Q(S,A) is being consistently and only updated by the reward.

The above is inapplicable to LLMs, I think. (And this seems tied IMO to the fact that LLMs can’t do great novel science yet etc.) But it does apply to FixDT.

Specifically, for things like FixDT, if there are multiple fixed points (e.g. I expect to stand up, and then I stand up, and thus the prediction was correct), then whatever process you use to privilege one fixed point over another, you’re not fighting against the above L (i.e., the “epistemic” loss L based on self-supervised learning and/or self-consistency or whatever). L is applying no force either way. It’s a wide-open degree of freedom.

(If your response is “L incentivizes fixed-points that make the world easier to predict”, then I don’t think that’s a correct description of what such a learning algorithm would do.)

So if your feedback proposal exclusively involves a mechanism that privileging one fixed point over another, then I have no complaints, and would describe it as choosing a utility function (preferences not beliefs) within the FixDT framework.

Btw I think we’re in agreement that there should be some mechanism privileging one fixed point over another, instead of ignoring it and just letting the underdetermined system do whatever it does.

Updating on things being true or false cannot rule out agentic hypotheses (the inner optimizer problem). … Any sufficiently rich hypotheses space has agentic policies, which can't be ruled out by the feedback.

Oh, I want to set that problem aside because I don’t think you need an arbitrarily rich hypothesis space to get ASI. The agency comes from the whole AI system, not just the “epistemic” part, so the “epistemic” part can be selected from a limited model class, as opposed to running arbitrary computations etc. For example, the world model can be “just” a Bayes net, or whatever. We’ve talked about this before.

Reinforcement Learning cannot rule out the wireheading hypothesis or human-manipulation hypothesis.

I also learned the term observation-utility agents from you :) You don’t think that can solve those problems (in principle)?

I’m probably misunderstanding you here and elsewhere, but enjoying the chat, thanks :)

The OP talks about the fact that evolution produced lots of organisms on Earth, of which humans are just one example, and that if we view the set of all life, arguably more of it consists of bacteria or trees than humans. Then this comment thread has been about the question: so what? Why bring that up? Who cares?

Like, here’s where I think we’re at in the discussion:

Nate or Eliezer: “Evolution made humans, and humans don’t care about inclusive genetic fitness.”

tailcalled: “Ah, but did you know that evolution also made bacteria and trees?”

Nate or Eliezer: “…Huh? What does that have to do with anything?”

If you think that the existence on Earth of lots of bacteria and trees is a point that specifically undermines something that Nate or Eliezer said, then can you explain the details?

Here’s a sensible claim:

CLAIM A: “IF there’s a learning algorithm whose reward function is X, THEN the trained models that it creates will not necessarily explicitly desire X.”

This is obviously true, and every animal including humans serves as an example. For most animals, it’s trivially true, because most animals doesn’t even know what inclusive genetic fitness is, so obviously they don’t explicitly desire it.

So here’s a stronger claim:

CLAIM B: “CLAIM A is true even if the trained model is sophisticated enough to fully understand what X is, and to fully understand that it was itself created by this learning algorithm.”

This one is true too, and I think humans are the only example we have. I mean, the claim is really obvious if you know how algorithms work etc., but of course some people question it anyway, so it can be nice to have a concrete illustration.

(More discussion here.)

Neither of those claims has anything to do with humans being the “winners” of evolution. I don’t think there’s any real alignment-related claim that does. Although, people say all kinds of things, I suppose. So anyway, if there’s really something substantive that this post is responding to, I suggest you try to dig it out.

I’ve been on twitter since 2013 and have only ever used the OG timeline (a.k.a. chronological, a.k.a. “following”, a.k.a. every tweet from the people you follow and no others). I think there were periods where the OG timeline was (annoyingly) pretty hard to find, and there were periods where you would be (infuriatingly) auto-switched out of the OG timeline every now and then (weekly-ish?) and had to manually switch back. The OG timeline also has long had occasional advertisements of course. And you might be right that (in some periods) the OG timeline also included occasional other tweets that shouldn’t be in the OG timeline but were thrown in. IIRC, I thought of those as being in the same general category as advertisements, but just kinda advertisements for using more twitter. I think there was a “see less often” option for those, and I always selected that, and I think that helped maintain the relative purity of my OG timeline.

FWIW I don’t think “self-models” in the Intuitive Self-Models sense are related to instrumental power-seeking—see §8.2. For example, I think of my toenail as “part of myself”, but I’m happy to clip it. And I understand that if someone “identifies with the universal consciousness”, their residual urges towards status-seeking, avoiding pain, and so on are about the status and pain of their conventional selves, not the status and pain of the universal consciousness. More examples here and here.

Separately, I’m not sure what if anything the Intuitive Self-Models stuff has to do with LLMs in the first place.

But there’s a deeper problem: the instrumental convergence concern is about agents that have preferences about the state of the world in the distant future, not about agents that have preferences about themselves. (Cf. here.) So for example, if an agent wants there to be lots of paperclips in the future, then that’s the starting point, and everything else can be derived from there.

  • Q: Does the agent care about protecting “the temporary state of the execution of the model (or models)”?
    • A: Yes, if and only if protecting that state is likely to ultimately lead to more paperclips.
  • Q: Does the agent care about protecting “the compute resources (CPU/GPU/RAM) allocated to run the model and its collection of support programs”?
    • A: Yes, If and only if protecting those resources is likely to ultimately lead to more paperclips.

Etc. See what I mean? That’s instrumental convergence, and self-models have nothing to do with it.

Sorry if I’m misunderstanding.

Thanks for the comment!

people report advanced meditative states that lose many of the common properties of consciousness, including Free Will, the feeling of having a self (I've experienced that one!) and even the presence of any information content whatsoever, and afaik they tend to be more "impressed", roughly speaking, with consciousness as a result of those experiences, not less.

I think that’s compatible with my models, because those meditators still have a cortex, in which patterns of neurons can be firing or not firing at any particular time. And that’s the core aspect of the “territory” which corresponds to “conscious awareness” in the “map”. No amount of meditation, drugs, etc., can change that.

Attempt to rephrase: the brain has several different intuitive models in different places. These models have different causal profiles, which explains how they can correspond to different introspective reports. 

Hmm, I think that’s not really what I would say. I would say that that there’s a concept “conscious awareness” (in the map) that corresponds to the fact (in the territory) that different patterns of neurons can be active or inactive in the cortex at different times. And then there are more specific aspects of “conscious awareness”, like “visual awareness”, which corresponds to the fact that the cortex has different parts (motor cortex etc.), and different patterns of neurons can be active or inactive in any given part of the cortex at different times.

…Maybe this next part will help ↓

the distinction between visually vivid experience and vague intuitions isn't just that we happen to call them by different labels … Claiming to see a visual image is different from claiming to have a vague intuition in all the ways that it's different

The contents of IT are really truly different from the contents of LIP [I didn’t check where the visual information gets to the cortex in blindsight, I’m just guessing LIP for concreteness]. Querying IT is a different operation than querying LIP. IT holds different types of information than LIP does, and does different things with that information, including leading to different visceral reactions, motivations, semantic knowledge, etc., all of which correspond to neuroscientific differences in how IT versus LIP is wired up. 

All these differences between IT vs LIP are in the territory, not the map. So I definitely agree that “the distinction [between seeing and vague-sense-of-presence] isn’t just that we happen to call them by different labels”. They’re different like how the concept “hand” is different from the concept “foot”—a distinction on the map downstream of a distinction in the territory.

Is awareness really a serial processor in any meaningful way if it can contain as much information at once as a visual image seems to contain?

I’m sure you’re aware that people feel like they have a broader continuous awareness of their visual field than they actully do. There are lots of demonstrations of this—e.g. change blindness, selective attention test, the fact that peripheral vision has terrible resolution and terrible color perception and makes faces look creepy. There’s a refrigerator light illusion thing—if X is in my peripheral vision, then maybe it’s currently active as just a little pointer in a tiny sub-area of my cortex, but as soon as I turn my attention to X it immediately unfolds in full detail across the global workspace.

The cortex has 10 billion neurons which is more than enough to do some things in parallel—e.g. I can have a song stuck in my head in auditory cortex, while tapping my foot with motor cortex, while doing math homework with other parts of the cortex. But there’s also a serial aspect to it—you can’t parse a legal document and try to remember your friend’s name at the exact same moment.

Does that help? Sorry if I’m not responding to what you see as most important, happy to keep going.  :)

Thanks for the detailed comment!

Well, post #2 is about conscious awareness so it gets the closest, but you only really talk about how there is a serial processing stream in the brain whose contents roughly correspond to what we claim is in awareness -- which I'd argue is just the coarse functional behavior, i.e., the macro problem. This doesn't seem very related to the hard meta problem because I can imagine either one of the problems not existing without the other. I.e., I can imagine that (a) people do claim to be conscious but in a very different way, and (b) people don't claim to be conscious, but their high-level functional recollection does match the model you describe in the post. And if that's the case, then by definition they're independent. … if you actually ask camp #2 people, I think they'll tell you that the problem isn't really about the macro functional behavior of awareness

The way intuitive models work (I claim) is that there are concepts, and associations / implications / connotations of those concepts. There’s a core intuitive concept “carrot”, and it has implications about shape, color, taste, botanical origin, etc. And if you specify the shape, color, etc. of a thing, and they’re somewhat different from most normal carrots, then people will feel like there’s a question “but now is it really a carrot?” that goes beyond the complete list of its actual properties. But there isn’t, really. Once you list all the properties, there’s no additional unanswered question. It just feels like there is. This is an aspect of how intuitive models work, but it doesn’t veridically correspond to anything of substance.

The old Yudkowsky post “How An Algorithm Feels From Inside” is a great discussion of this point.

So anyway, if “consciousness” has connotations / implications A,B,C,D,E, etc. (it’s “subjective”, it goes away under general anesthesia, it’s connected to memory, etc.), then people will feel like there’s an additional question “but is it really consciousness”, that still needs to be answered, above and beyond the specific properties A,B,C,D,E.

And likewise, if you ask a person “Can you imagine something that lacks A,B,C,D,E, but still constitutes ‘consciousness’”, then they may well say “yeah I can imagine that”. But we shouldn’t take that report to be particularly meaningful.

(…See also Frankish’s “Quining Diet Qualia” (2012).)

Copying the above terminology, we could phrase the hard problem of seeing as explaining why people see images, and the hard meta problem of seeing as explaining why people claim to see images.

As in Post 2, there’s an intuitive concept that I’m calling “conscious awareness” that captures the fact that the cortex has different generative models active at different times. Different parts of the cortex wind up building different kinds of models—S1 builds generative models of somatosensory data, M1 builds generative models of motor programs, and so on. But here I want to talk about the areas in the overlap between the “ventral visual stream” and the “global workspace”, which is mainly in and around the inferior temporal gyrus, “IT”.

When we’re paying attention to what we’re looking at, IT would have some generative model active that optimally balances between (1) priors about the visual world, and (2) the visual input right now. Alternatively, if we’re zoning out from what we’re looking at, and instead using visual imagination or visual memory, then (2) is off (i.e., the active IT model can be wildly incompatible with immediate visual input), but (1) is still relevant, and instead there needs to be consistency between IT and episodic memory areas, or various other possibilities.

So anyway,

  • In the territory, “Model A is currently active in IT” is a very different situation from “Model B is currently active in the superior temporal gyrus” or whatever.
  • Correspondingly, in the map, we wind up with the intuition that “X is in awareness as a vision” is very different from “Y is in awareness as a sound”, and both are very different from “Z is in awareness as a plan”, etc.

You brought up blindsight. That would be where the model “X is in awareness as a vision” seems wrong. That model would entail a specific set of predictions about the state of IT, and it turns out that those predictions are false. However, some other part of awareness is still getting visual information via some other pathway. (Visual information gets into various parts of the cortex via more than one pathway.) So the blindsight patient might describe their experience as “I don’t see anything, but for some reason I feel like there’s motion on the left side”, or whatever. And we can map that utterance into a correct description of what was happening in their cortex.

 

Separately, as for the hard problem of consciousness, you might be surprised to learn that I actually haven’t thought about it much and still find it kinda confusing. I had written something into an early draft of post 1 but wound up deleting it before publication. Here’s what it said:

Start with an analogy to physics. There’s a Stephen Hawking quote I like:

> “Even if there is only one possible unified theory, it is just a set of rules and equations. What is it that breathes fire into the equations and makes a universe for them to describe? The usual approach of science of constructing a mathematical model cannot answer the questions of why there should be a universe for the model to describe. Why does the universe go to all the bother of existing?”

I could be wrong, but Hawking’s question seems to be pointing at a real mystery. But as Hawking says, there seems to be no possible observation or scientific experiment that would shed light on that mystery. Whatever the true laws of physics are in our universe, every possible experiment would just confirm, yup, those are the true laws of physics. It wouldn’t help us figure out what if anything “breathes fire” into those laws. What would progress on the “breathes fire” question even look like?? (See Tegmark’s Mathematical Universe book for the only serious attempt I know of, which I still find unsatisfying. He basically says that all possible laws of the universe have fire breathed into them. But even if that’s true, I still want to ask … why?)

By analogy, I’m tempted to say that an illusionist account can explain every possible experiment about consciousness, including our belief that consciousness exists at all, and all its properties, and all the philosophy books on it, and so on … but yet I’m tempted to still say that there’s some “breathes fire” / “why is there something rather than nothing” type question left unanswered by the illusionist account. This unanswered question should not be called “the hard problem”, but rather “the impossible problem”, in the sense that, just like Hawking’s question above, there seems to be no possible scientific measurement or introspective experiment and that could shed light on it—all possible such data, including the very fact that I’m writing this paragraph, are already screened off by the illusionist framework.

Well, hmm, maybe that’s stupid. I dunno.

Thanks!

Do you have any thoughts on why then does psychosis typically suddenly 'kick in' in late adolescence / early adulthood? 

Yeah as I discussed in Schizophrenia as a deficiency in long-range cortex-to-cortex communication Section 4.1, I blame synaptic pruning, which continues into your 20s.

and why trauma correlates with it and tends to act as that 'kickstarter'?

No idea. As for “kickstarter”, my first question is: is that actually true? It might be correlation not causation. It’s hard to figure that out experimentally. That said, I have some discussion of how strong emotions in general, and trauma in particular, can lead to hallucinations (e.g. hearing voices) and delusions via a quite different mechanism in [Intuitive self-models] 7. Hearing Voices, and Other Hallucinations. I’ve been thinking of “psychosis via disjointed cognition” (schizophrenia & mania per this post) and “psychosis via strong emotions” (e.g. trauma, see that other post) as pretty different and unrelated, but I guess it’s maybe possible that there’s some synergy where their effects add up such that someone who is just under the threshold for schizophrenic delusions can get put over the top by strong emotions like trauma.

Also any thoughts about delusions? Like how come schizophrenic people will occasionally not just believe in impossible things but very occasionally even random things like 'I am Jesus Christ' or 'I am Napoleon'?

I talk about that a bit better in the other post:

In the diagram above, I used “command to move my arm” as an example. By default, when my brainstem notices my arm moving unexpectedly, it fires an orienting / startle reflex—imagine having your arm resting on an armrest, and the armrest suddenly starts moving. Now, when it’s my own motor cortex initiating the arm movement, then that shouldn’t be “unexpected”, and hence shouldn’t lead to a startle. However, if different parts of the cortex are sending output signals independently, each oblivious to what the other parts are doing, then a key prediction signal won’t get sent down into the brainstem, and thus the motion will in fact be “unexpected” from the brainstem’s perspective. The resulting suite of sensations, including the startle, will be pretty different from how self-generated motor actions feel, and so it will be conceptualized differently, perhaps as a “delusion of control”.

That’s just one example. The same idea works equally well if I replace “command to move my arm” with “command to do a certain inner speech act”, in which case the result is an auditory hallucination. Or it could be a “command to visually imagine something”, in which case the result is a visual hallucination. Or it could be some visceromotor signal that causes physiological arousal, perhaps leading to a delusion of reference, and so on.

So, I dunno, imagine that cortex area 1 is a visceromotor area saying “something profoundly important is happening right now!” for some random reason, and independently, cortex area 2 is saying “who am I?”, and independently, cortex area 3 is saying “Napoleon”. All three of these things are happening independently and unrelatedly. But because of cortex area 1, there’s strong physiological arousal that sweeps through the brain and locks in this configuration within the hippocampus as a strong memory that “feels true” going forward.

That’s probably not correct in full detail, but my guess is that it’s something kinda like that.

I’d bet that Noam Brown’s TED AI talk has a lot of overlap with this one that he gave in May. So you don’t have to talk about it second-hand, you can hear it straight from the source.  :) In particular, the “100,000×” poker scale-up claim is right near the beginning, around 6 minutes in.

The goal is to have a system where there are no unlabeled parameters ideally. That would be the world modeling system. It then would build a world model that would have many unlabeled parameters.

Yup, this is what we’re used to today:

  • there’s an information repository,
  • there’s a learning algorithm that updates the information repository,
  • there’s an inference algorithm that queries the information repository,
  • both the learning algorithm and the inference algorithm consist of legible code written by humans, with no inscrutable unlabeled parameters,
  • the high-dimensional space [or astronomically-large set, if it’s discrete] of all possible configurations of the information repository is likewise defined by legible code written by humans, with no inscrutable unlabeled parameters,
  • the only inscrutable unlabeled parameters are in the content of the information repository, after the learning algorithm has been running for a while.

So for example, in LLM pretraining, the learning algorithm is backprop, the inference algorithm is a forward pass, and the information repository is the weights of a transformer-architecture neural net. There’s nothing inscrutable about backprop, nor about a forward pass. We fully understand what those are doing and how. Backprop calculates the gradient, etc.

That’s just one example. There are many other options! The learning algorithm could involve TD learning. The inference algorithm could involve tree search, or MCMC, or whatever. The information repository could involve a learned value function and/or a learned policy and/or a learned Bayes net and/or a learned OpenCog AtomSpace or whatever. But in all cases, those six bullets above are valid.

So anyway, this is already how ML works, and I’m very confident that it will remain true until TAI, for reasons here. And this is a widespread consensus.

By understanding the world modeler system you can ensure that the world model has certain properties. E.g. there is some property (which I don't know) of how to make the world model not contain dangerous minds.

There’s a very obvious failure mode in which: the world-model models the world, and the planner plans, and the value function calculates values, etc. … and at the end of all that, the AI system as a whole hatches and executes a plan to wipe out humanity. The major unsolved problem is: how do we confidently avoid that?

Then separately, there’s a different, weird, exotic type of failure mode, where, for example, there’s a full-fledged AGI agent, one that can do out-of-the-box foresighted planning etc., but this agent is not working within the designed AGI architecture (where the planner plans etc. as above), but rather the whole agent is hiding entirely within the world-model. I think that, in this kind of system, the risk of this exotic failure mode is very low, and can be straightforwardly mitigated to become even lower still. I wrote about it a long time ago at Thoughts on safety in predictive learning.

I really think we should focus first and foremost on the very obvious failure mode, which again is an unsolved problem that is very likely to manifest, and we should put aside the weird exotic failure mode at least until we’ve solved the big obvious one.

When we put aside the exotic failure mode and focus on the main one, then we’re no longer worried about “the world model contains dangerous minds”, but rather we’re worried about “something(s) in the world model has been flagged as desirable, that shouldn’t have been flagged as desirable”. This is a hard problem not only because of the interpretability issue (I think we agree that the contents of the world-model are inscrutable, and I hope we agree that those inscrutable contents will include both good things and bad things), but also because of concept extrapolation / goal misgeneralization (i.e., the AGI needs to have opinions about plans that bring it somewhere out of distribution). It’s great if you want to think about that problem, but you don’t need to “understand intelligence” for that, you can just assume that the world-model is a Bayes net or whatever, and jump right in! (Maybe start here!)

To me it just seems that limiting the depth of a tree search is better that limiting the compute of a black box neural network. It seems like you can get a much better grip on what it means to limit the depth, and what this implies about the system behavior, when you actually understand how tree search works. Of cause tree search here is only an example.

Right, but the ability to limit the depth of a tree search is basically useless for getting you to safe and beneficial AGI, because you don’t know the depth that allows dangerous plans, nor do you know that dangerous plans won’t actually be simpler (less depth) than intended plans. This is a very general problem. This problem applies equally well to limiting the compute of a black box, limiting the number of steps of MCMC, limiting the amount of (whatever OpenCog AtomSpace does), etc.

[You can also potentially use tree search depth to try to enforce guarantees about myopia, but that doesn’t really work for other reasons.]

Python code is a discrete structure. You can do proofs on more easily than for a NN. You could try to apply program transformations on it that preserve functional equality, trying to optimize for some measure of "human understandable structure". There are image classification alogrithms iirc that are worse than NN but much more interpretable, and these algorithms would at most be hundets of lines of code I guess (haven't really looked a lot at them).

“Hundreds of lines” is certainly wrong because you can recognize easily tens of thousands of distinct categories of visual objects. Probably hundreds of thousands.

Proofs sound nice, but what do you think you can realistically prove that will help with Safe and Beneficial AGI? You can’t prove things about what AGI will do in the real world, because the real world will not be encoded in your formal proof system. (pace davidad).

“Applying program transformations that optimize for human understandable structure” sounds nice, but only gets you to “inscrutable” from “even more inscrutable”. The visual world is complex. The algorithm can’t be arbitrarily simple, while still capturing that complexity. Cf. “computational irreducibility”.

I'm not brainstorming on "how could this system fail". Instead I understand something, and then I just notice without really trying, that now I can do a thing that seems very useful, like making the system not think about human psycology given certain constraints.

What I’m trying to do in this whole comment is point you towards various “no-go theorems” that Eliezer probably figured out in 2006 and put onto Arbital somewhere.

Here’s an analogy. It’s appealing to say: “I don’t understand string theory, but if I did, then I would notice some new obvious way to build a perpetual motion machine.”. But no, you won’t. We can rule out perpetual motion machines from very general principles that don’t rely on how string theory works.

By the same token, it’s appealing to say: “I don’t understand intelligence, but if I did, then I would notice some new obvious way to guarantee that an AGI won’t try to manipulate humans.”. But no, you won’t. There are deep difficulties that we know you’re going to run into, based on very general principles that don’t rely on the data format for the world-model etc.

I suggest to think harder about the shape of the solution—getting all the way to Safe & Beneficial AGI. I think you’ll come to realize that figuring out the data format for the world-model etc. is not only dangerous (because it’s AGI capabilities research) but doesn’t even help appreciably with safety anyway.

Load More