I was recently talking to a bisexual friend who cannot comprehend what it is like to be gay or straight. Her argument goes like this: Suppose you meet with a smart, pretty, high-status person but his/her genitals are concealed and his/her other sexually-dimorphic features are ambiguous. Would you want to kiss this person?

If you're bisexual then the answer is "yes", but if you're gay or straight then your answer is "I need to know what genitals he/she has". This is interesting from an algorithmic perspective.

Beyond Reinforcement Learning

In the beginning, Freud invented psychology. His models of how the mind worked were mostly wrong (except for the idea that most thinking is unconscious). But his worst sin was that his theories tended to be unfalsifiable. To advance, psychology needed to be grounded in material observations. In response, the Behaviorist school of psychology was invented, which bucketed every behavior into either a simple reflex or a consequence of operant conditioning (reinforcement learning).

Behaviorism was a step up from Freudian psychoanalysis, but Behaviorism is an ideology and is therefore wrong. Many behaviors, like blinking when an object comes near your eye, are simple reflexes. Other behaviors, especially addictions like compulsive drinking and gambling, and consequences of reinforcement learning. But people exhibit complex behaviors in pursuit of goals that have never been reinforced.

Consider a horny virginal teenage boy. He has never had sex but he wants to, so he seduces a woman to have sex with. Seducing a mate cannot be a reflex. That's because it is a complex process the details of which vary wildly across time and between cultures. Seducing a mate can be reinforced, but it can't be the result of reinforcement learning the first time you do it, because the first time you do something is before you've been reinforced.

Exhibiting complex strategic behavior in pursuit of goal (especially an abstract goal) is beyond the domain of reflexes and reinforcement learning. Something else must be going on.

Strategizers

If you want a mind to exhibit strategic behavior, then you need to give it a causal world model, a value system and a search algorithm. The search algorithm simulates several realities via the causal world model and ranks them via the value system.

The causal world models in the human mind are created via predictive processing. How predictive processing works is beyond the scope of this post, but the basic idea is that if you feed a bunch of sensory data into a net of neurons that are all trying to minimize free energy by anticipating their own inputs, then the neurons will learn to simulate the external universe in real time.

An internal model of the external universe isn't useful unless you have a way of producing causal outputs. One cool feature of predictive processors is that they can produce causal outputs. If you connect the neurons in a predictive processor into motor (output) neurons, then the predictive processor will learn to send motor outputs which minimize predictive error i.e minimize surprise.

We haven't programmed the predictive processor to do anything yet but it already has values (namely, to minimize surprise). The Orthogonality Thesis states that an agent can have any combination of intelligence level and final goal. Predictive processors are non-orthogonal. They cannot have any final goal because any goal must, in some sense, minimize free energy (surprise).

Wait a minute. Don't people we get bored and seek out novelty? And isn't novelty a form of surprise (which increases free energy)? Yes, but that's because the human brain isn't a pure predictive processor. The brain gets a squirt of dopamine when it exhibits a behavior that evolution wants to reinforce. Dopamine-moderated reinforcement learning alone is enough to elicit non-free-energy-minimizing behaviors (such as gambling) from a predictive processor.

Abstract Concepts

So far, our predictive processor can only pursue simple goals defined in terms of raw sensory stimuli. How do we program it to robustly identify abstract concepts like male and female? Via checksums. If you point several hard-coded heuristics at a small region of the neural network, then that region of the neural network will generalize them into a single abstract concept that will feel like a fundamental building block of reality.

[Click here for more information about how checksums work in a predictive processor.]

Complex abstract concepts can be bootstrapped from simple abstract concepts by using the activations of simple concepts as heuristics for more complex concepts. Such a model of the world is robust against noise, error and even deception due to hierarchical use of checksums. The binary activations of abstract-concept-coding regions even makes our predictive processor's world model robust against gradient descent attacks. (Gradient descent attacks by outside forces aren't possible as they are in machine learning, but they are a threat you must defend against internally to prevent wireheading.)

We now have all the components we need to create a system that robustly pursues abstract goals.

Our predictive processor creates a model of the universe via minimizing free energy. We can force the predictive processor to think in terms of the abstract concepts of our choosing by pointing a bunch of heuristics at a small region.

Consider the concept of sex. The most obvious way to program it is to have two adjacent clusters of neurons. One activates in response to femaleness and deactivates in response to maleness. The other activates in response to maleness and deactivates in response to femaleness. Activating one cluster suppresses the other cluster and vice versa. Point all our male heuristics (like broad shoulders and a deep voice) toward the male cluster and all our female clusters (like breasts and variable tonal inflection) toward the female cluster. We now have a useful ontology for navigating human culture.

We have also programmed our predictive processor to be prejudiced against transsexuals and extremely prejudiced against nonbinary people. Our predictive processor wants sexual deviants—anyone who behaves unexpectedly for their sex—to literally not exist.

And that, I think, is why sexual deviants (such as gays) used to be sent to death camps.

Credits

Thanks Justis for helping edit this.

New Comment
14 comments, sorted by Click to highlight new comments since:

Suppose you meet with a smart, pretty, high-status person but his/her genitals are concealed and his/her other sexually-dimorphic features are ambiguous. Would you want to kiss this sexy person?

Begging the question. It assumes that the person is "pretty" and "sexy", but isn't that the part we were actually curious about? A better question would be: "Do you find people with sexually dimorphic features sexy?" Some people would say "yes", some people would say "no".

Good point. I've recently been talking with someone whose native language isn't English and we've been using "pretty" as the imprecise translation of a non-gender-specific adjective. I have removed the word "sexy" entirely.

We now have all the components we need to create a system that robustly pursues abstract goals.

(I note that if this part works out reliably, alignment would essentially be solved.)

I would be flattered, had your comment be a compliment. ☺

What I meant is that we have a system with a self-correcting world model which solves the "finger pointing at the Moon" problem. It optimizes the world according to its beliefs about the Moon, even though all we could give it was the finger.

To be clear, I don't necessarily think you're wrong about how bio brains do it. A lot rests on the word "reliably". One possible explanation for sexual fetishes is that the human biological mechanism for pointing at sexual partners is quite unreliable (a hypothesis I predict you agree with).

But if we could get a similar mechanism to work reliably, we'd have a mechanism for pointing learning machines at things in the world.

We haven't programmed the predictive processor to do anything yet but it already has values (namely, to minimize surprise). The Orthogonality Thesis states that an agent can have any combination of intelligence level and final goal. Predictive processors are non-orthogonal. They cannot have any final goal because any goal must, in some sense, minimize free energy (surprise).

I could be wrong, because my understanding of PP is limited, but this feels like a level-confused argument to me. Adjusting neural weights to minimize surprise does not imply the same thing as active planning to minimize future surprise!! For example, generative pretraining for large language models (LLMs) minimizes next-bit prediction error, but does not train behaviors which actively plan ahead to manipulate the environment to minimize future predictive error. (Any such behaviors would be a side-effect due to unexpected inner optimization misgeneralizing the goal; not at all directly selected for.)

On the other hand, fine-tuning of LLMs can use RL, which is to say, can optimize for longer-term sequential strategies which accomplish specific objectives. 

So gradient descent can do either thing -- pure prediction (supervised learning from labeled data, eg sequences labeled with continuations) or reinforcement learning (incentivizing long-term planning by assigning credit to many recent actions rather than just the one most recent output).

An earlier post of yours claims PP is equivalent to gradient descent, so I assume PP can also do both of those things, although I don't currently understand how PP does RL. But if so, there should be no obstacle to the orthogonality thesis within PP.

Having now worked through some of the technical details of the equivalence between backprop and predictive coding, I still think my objection is right.

Your comment is a good one. I want to give it the treatment it deserves. There are a few ways I can better explain what I'm attempting to get at in my original post, and I'm not sure what the best approach is.

Instead of addressing you point-by-point, I'm going to try backing up and looking at the bigger picture.

The Orthogonality Thesis

I think that the most important thing you take issue with is my claim that PP violates the orthogonality thesis. Well, I also claim that PP is (in some abstract mathematical sense) equivalent to backpropagation. If PP violates the orthogonality thesis then I should be able to provide an example of how backpropagation violates the orthogonality thesis too.

Consider a backpropagation-based feed-forward neural network. Our FFNN is the standard multilayer perceptron (perhaps with improvements such as an attention mechanism). The FFNN has some input sensors which read data from the real world. These can be cameras, microphones and an Internet connection. The FFNN takes actions on the outside work via its output nodes. Its output nodes are hooked up to robots.

We train our FFNN via backpropagation. We feed in examples of sensory information. The FFNN generates actions. Then we calculate the error between what we want the FFNN to output and what the FFNN actually did output. Then we use the backpropagation algorithm to adjust the internal weights of the FFNN.

Is there anything we can't teach the FFNN via this method?

We can teach it play chess, build cars, take over the world and disassemble stars. But there is one thing we can't teach it to do: We can't teach it to maximize its own error function. It's not just physically impossible. It's a logical contradiction.

Given sufficient data, a FFNN can map any reasonably behaved function from . But it can't necessarily map any function , because self-reference imposes a cyclic constrant.

If you build a FFNN in the real world and want it to optimize the real world…well, the FFNN is part of the real world. That causes a self-reference, which constrains the freedom of the orthogonality thesis.

I don't expect this explanation to fully answer all of your objections, but I hope it gets us closer to understanding each other.

Clarifying my original post

You write "I don't currently understand how PP does RL". I'm not claiming that PP does RL. PP can do RL, but that's not important. The biological neural network model in my original post is getting trained simultaneously by two different algorithms with two different optimization targets. The PP algorithm is running at all times and is training the neural network to minimize surprise. The RL algorithm is activated intermittently and trains the neural network to take actions that produce squirts of dopamine.

Adjusting neural weights to minimize surprise does not imply the same thing as active planning to minimize future surprise!!

You're right. The model I've described only does local gradient descent (of surprise, not error). It doesn't do strategic planning (unless it developed emergent complex machinery to do so).

You write "I don't currently understand how PP does RL". I'm not claiming that PP does RL. PP can do RL, but that's not important. The biological neural network model in my original post is getting trained simultaneously by two different algorithms with two different optimization targets. The PP algorithm is running at all times and is training the neural network to minimize surprise. The RL algorithm is activated intermittently and trains the neural network to take actions that produce squirts of dopamine.

To be clear here, I did understand that you posit a dual-system approach in this post, with squirts of dopamine for RL and PP for everything else. However, I didn't really understand why you wanted to posit that, in the context of your other posts, where you mention PP doing the RL part too.

We can teach it play chess, build cars, take over the world and disassemble stars. But there is one thing we can't teach it to do: We can't teach it to maximize its own error function. It's not just physically impossible. It's a logical contradiction.

But is this important/interesting?

Here's my problem. I feel like in general when talking about PP, I end up chasing shadows. First, there's a lot of naive PP discourse out there, where people just talk about "minimizing free energy" like it explains everything, with no apparent understanding of the nuances behind what different types of free energy you can minimize, and in what kind of minimization framework, etc. People claiming that they can explain any psychological phenomena in terms of minimizing predictive error. So you get paragraphs like:

If you connect the neurons in a predictive processor into motor (output) neurons, then the predictive processor will learn to send motor outputs which minimize predictive error i.e minimize surprise.

And then you gen the semi-experts/semi-dilettantes, who have read a few papers on the subject and can't claim to explain everything but recognize the obvious fallacies and believe that there are ways around them.  

So then you get paragraphs like:

Wait a minute. Don't people we get bored and seek out novelty? And isn't novelty a form of surprise (which increases free energy)? Yes, but that's because the human brain isn't a pure predictive processor. The brain gets a squirt of dopamine when it exhibits a behavior that evolution wants to reinforce. Dopamine-moderated reinforcement learning alone is enough to elicit non-free-energy-minimizing behaviors (such as gambling) from a predictive processor.

Do you see my problem yet? First you start with a theory (free energy minimization) which can already explain anything and everything, but if you take it really seriously, it does heuristically suggest some predictions over others. And then some of those predictions are wrong; EG it predicts that organisms disproportionately like to hang out in dark, quiet rooms where there's no surprise. So maybe you retreat to the general can-predict-anything version. Or maybe you start patching it, by tacking on some amount of RL. Or maybe you do something else. 

This seems to me like a recipe for scientific disaster.

I get this feeling that people must be initially attracted to PP by (a) the promised generality (which actually means it doesn't predict anything very strongly), or (b) the neat math, or (c) some particular clever arguments about how some specific phenomena can be understood as minimization of prediction error, like maybe how humans often seem to confuse 'is' with 'ought'. And then, if they get far enough, they start to see how the naive version can't make sense; but there are so many ways to patch it, and other smart people who seem to believe that things work out...

I haven't examined the pile of evidence that's supposedly in favor of actual PP in the actual brain. I'm missing a ton of context. I just get the feeling from a distance, that it's this intellectual black hole. 

We can't teach it to maximize its own error function. It's not just physically impossible. It's a logical contradiction.

But is this important/interesting?

Because it implies the existence of a fixed point of epistemic convergence that's robust against wireheading. It solves one of the fundamental questions of AI Alignment, at least in theory.

Do you see my problem yet? First you start with a theory (free energy minimization) which can already explain anything and everything, but if you take it really seriously, it does heuristically suggest some predictions over others. And then some of those predictions are wrong; EG it predicts that organisms disproportionately like to hang out in dark, quiet rooms where there's no surprise. So maybe you retreat to the general can-predict-anything version. Or maybe you start patching it, by tacking on some amount of RL. Or maybe you do something else.

I totally hang out in dark, quiet rooms where there's no surprise.

But more seriously, this is basically how evolution works too. It starts with a simple system and then it patches it. Evolved systems are messy and convoluted.

This seems to me like a recipe for scientific disaster.…I haven't examined the pile of evidence that's supposedly in favor of actual PP in the actual brain. I'm missing a ton of context. I just get the feeling from a distance, that it's this intellectual black hole.

You're right. The problem is even broader than you write. Psychology is a recipe for scientific disaster. Freud was a disaster. The Behaviorists were (less of) a disaster. And those are (to my knowledge) the two most powerful schools in psychiatry.

But I think I'm mostly right about the basics, and the right thing to do under such circumstances is to post my predictions on a public forum. If you think I'm wrong, then you can register your counter-prediction and we can check back in 30 years and we'll see if one of us has been proven right.

But more seriously, this is basically how evolution works too. It starts with a simple system and then it patches it. Evolved systems are messy and convoluted.

I don't deny this. My fear isn't a general fear that any time we conclude there's a base system with some patches, we're wrong. Rather, I have a fear of using these patches to excuse a bad theory, like epicycle theory vs Newton. The specific worry is more like why do people start buying this in the first place? I've never seen concrete evidence that it helps people understand things?? And when people check the math in Friston papers, it seems to be a Swiss Cheese of errors???

If you think I'm wrong, then you can register your counter-prediction and we can check back in 30 years and we'll see if one of us has been proven right.

To state the obvious, this feedback loop is too slow, but obviously that's compatible with your point here.

Still, I hope we can find predictions that can be tested faster.

Or even moreso, I hope that we can spell out reasons for believing things which help us find double-cruxes which we can settle through simple discussion. 

Treating "PP" as a monolithic ideology probably greatly exaggerates the seeming disagreement. I don't have any dispute with a lot of the concrete PP methodology. For example, the predictive coding = gradient descent paper commits no sins by my lights. I haven't understood the math in enough detail to believe the biological implications yet (I feel, uneasily, like there might be a catch somewhere which makes it still not too biologically plausible). But at base, it's a result showing that a specific variational method is in-some-sense equivalent to gradient descent. 

(As long as we're in the realm of "some specific variational method" instead of blurring everything together into "free energy minimization", I'm relatively happier.)

If you want to get into that level of technical granularity then there are major things that need to change before applying the PP methodology in the paper to real biological neurons. Two of the big ones are brainwave oscillations and existing in the flow of time.

Mostly what I find interesting is the theory that the bulk of animal brain processing goes into creating a real-time internal simulation of the world, that this is mathematically plausible via forward-propagating signals, and that error and entropy are fused together.

When I say "free energy minimization" I mean the idea that error and surprise are fused together (possibly with an entropy minimizer thrown in).

Because it implies the existence of a fixed point of epistemic convergence that's robust against wireheading. It solves one of the fundamental questions of AI Alignment, at least in theory.

Your claim is a variant of, like, "you can't seek to minimize your own utility function". Like, sure, yeah...

I expected that the historical record would show that carefully spelled-out versions of the orthogonality thesis would claim something like "preferences can vary almost independently of intelligence" (for reasons such as, an agent can prefer to behave unintelligently; if it successfully does so, it scarcely seems fair to call it highly intelligent, at least in so far as definitions of intelligence were supposed to be behavioral).

I was wrong; it appears that historical definitions of the orthogonality thesis do make the strong claim that goals can vary independently of intellect.

So yeah, I think there are some exceptions to the strongest form of the orthogonality thesis (at least, depending on definitions of intelligence). 

OTOH, the claims that no agent can seek to maximize its own learning-theoretic loss, or minimize its own utility-theoretic preferences, don't really speak against Orthogonality. Since they're intelligence-independent constraints.

But you were talking about wireheading.

How does agents cannot seek to maximize their own learning-theoretic loss take a bite out of wireheading? It seems entirely compatible with wireheading.

I appreciate your epistemic honesty regarding the historical record.

As for the theory of wireheading, I think it's drifting away from the original topic of my post here. I created a new post Self-Reference Breaks the Orthogonality Thesis which I think provides a cleaner version of what I'm trying to say, without the biological spandrels. If you want to continue this discussion, I think it'd be better to do so there.