I was recently talking to a bisexual friend who cannot comprehend what it is like to be gay or straight. Her argument goes like this: Suppose you meet with a smart, pretty, high-status person but his/her genitals are concealed and his/her other sexually-dimorphic features are ambiguous. Would you want to kiss this person?
If you're bisexual then the answer is "yes", but if you're gay or straight then your answer is "I need to know what genitals he/she has". This is interesting from an algorithmic perspective.
Beyond Reinforcement Learning
In the beginning, Freud invented psychology. His models of how the mind worked were mostly wrong (except for the idea that most thinking is unconscious). But his worst sin was that his theories tended to be unfalsifiable. To advance, psychology needed to be grounded in material observations. In response, the Behaviorist school of psychology was invented, which bucketed every behavior into either a simple reflex or a consequence of operant conditioning (reinforcement learning).
Behaviorism was a step up from Freudian psychoanalysis, but Behaviorism is an ideology and is therefore wrong. Many behaviors, like blinking when an object comes near your eye, are simple reflexes. Other behaviors, especially addictions like compulsive drinking and gambling, and consequences of reinforcement learning. But people exhibit complex behaviors in pursuit of goals that have never been reinforced.
Consider a horny virginal teenage boy. He has never had sex but he wants to, so he seduces a woman to have sex with. Seducing a mate cannot be a reflex. That's because it is a complex process the details of which vary wildly across time and between cultures. Seducing a mate can be reinforced, but it can't be the result of reinforcement learning the first time you do it, because the first time you do something is before you've been reinforced.
Exhibiting complex strategic behavior in pursuit of goal (especially an abstract goal) is beyond the domain of reflexes and reinforcement learning. Something else must be going on.
Strategizers
If you want a mind to exhibit strategic behavior, then you need to give it a causal world model, a value system and a search algorithm. The search algorithm simulates several realities via the causal world model and ranks them via the value system.
The causal world models in the human mind are created via predictive processing. How predictive processing works is beyond the scope of this post, but the basic idea is that if you feed a bunch of sensory data into a net of neurons that are all trying to minimize free energy by anticipating their own inputs, then the neurons will learn to simulate the external universe in real time.
An internal model of the external universe isn't useful unless you have a way of producing causal outputs. One cool feature of predictive processors is that they can produce causal outputs. If you connect the neurons in a predictive processor into motor (output) neurons, then the predictive processor will learn to send motor outputs which minimize predictive error i.e minimize surprise.
We haven't programmed the predictive processor to do anything yet but it already has values (namely, to minimize surprise). The Orthogonality Thesis states that an agent can have any combination of intelligence level and final goal. Predictive processors are non-orthogonal. They cannot have any final goal because any goal must, in some sense, minimize free energy (surprise).
Wait a minute. Don't people we get bored and seek out novelty? And isn't novelty a form of surprise (which increases free energy)? Yes, but that's because the human brain isn't a pure predictive processor. The brain gets a squirt of dopamine when it exhibits a behavior that evolution wants to reinforce. Dopamine-moderated reinforcement learning alone is enough to elicit non-free-energy-minimizing behaviors (such as gambling) from a predictive processor.
Abstract Concepts
So far, our predictive processor can only pursue simple goals defined in terms of raw sensory stimuli. How do we program it to robustly identify abstract concepts like male and female? Via checksums. If you point several hard-coded heuristics at a small region of the neural network, then that region of the neural network will generalize them into a single abstract concept that will feel like a fundamental building block of reality.
[Click here for more information about how checksums work in a predictive processor.]
Complex abstract concepts can be bootstrapped from simple abstract concepts by using the activations of simple concepts as heuristics for more complex concepts. Such a model of the world is robust against noise, error and even deception due to hierarchical use of checksums. The binary activations of abstract-concept-coding regions even makes our predictive processor's world model robust against gradient descent attacks. (Gradient descent attacks by outside forces aren't possible as they are in machine learning, but they are a threat you must defend against internally to prevent wireheading.)
We now have all the components we need to create a system that robustly pursues abstract goals.
Our predictive processor creates a model of the universe via minimizing free energy. We can force the predictive processor to think in terms of the abstract concepts of our choosing by pointing a bunch of heuristics at a small region.
Consider the concept of sex. The most obvious way to program it is to have two adjacent clusters of neurons. One activates in response to femaleness and deactivates in response to maleness. The other activates in response to maleness and deactivates in response to femaleness. Activating one cluster suppresses the other cluster and vice versa. Point all our male heuristics (like broad shoulders and a deep voice) toward the male cluster and all our female clusters (like breasts and variable tonal inflection) toward the female cluster. We now have a useful ontology for navigating human culture.
We have also programmed our predictive processor to be prejudiced against transsexuals and extremely prejudiced against nonbinary people. Our predictive processor wants sexual deviants—anyone who behaves unexpectedly for their sex—to literally not exist.
And that, I think, is why sexual deviants (such as gays) used to be sent to death camps.
Credits
Thanks Justis for helping edit this.
Your comment is a good one. I want to give it the treatment it deserves. There are a few ways I can better explain what I'm attempting to get at in my original post, and I'm not sure what the best approach is.
Instead of addressing you point-by-point, I'm going to try backing up and looking at the bigger picture.
The Orthogonality Thesis
I think that the most important thing you take issue with is my claim that PP violates the orthogonality thesis. Well, I also claim that PP is (in some abstract mathematical sense) equivalent to backpropagation. If PP violates the orthogonality thesis then I should be able to provide an example of how backpropagation violates the orthogonality thesis too.
Consider a backpropagation-based feed-forward neural network. Our FFNN is the standard multilayer perceptron (perhaps with improvements such as an attention mechanism). The FFNN has some input sensors which read data from the real world. These can be cameras, microphones and an Internet connection. The FFNN takes actions on the outside work via its output nodes. Its output nodes are hooked up to robots.
We train our FFNN via backpropagation. We feed in examples of sensory information. The FFNN generates actions. Then we calculate the error between what we want the FFNN to output and what the FFNN actually did output. Then we use the backpropagation algorithm to adjust the internal weights of the FFNN.
Is there anything we can't teach the FFNN via this method?
We can teach it play chess, build cars, take over the world and disassemble stars. But there is one thing we can't teach it to do: We can't teach it to maximize its own error function. It's not just physically impossible. It's a logical contradiction.
Given sufficient data, a FFNN can map any reasonably behaved function from F:A→B. But it can't necessarily map any function F:F,A→B, because self-reference imposes a cyclic constrant.
If you build a FFNN in the real world and want it to optimize the real world…well, the FFNN is part of the real world. That causes a self-reference, which constrains the freedom of the orthogonality thesis.
I don't expect this explanation to fully answer all of your objections, but I hope it gets us closer to understanding each other.
Clarifying my original post
You write "I don't currently understand how PP does RL". I'm not claiming that PP does RL. PP can do RL, but that's not important. The biological neural network model in my original post is getting trained simultaneously by two different algorithms with two different optimization targets. The PP algorithm is running at all times and is training the neural network to minimize surprise. The RL algorithm is activated intermittently and trains the neural network to take actions that produce squirts of dopamine.
You're right. The model I've described only does local gradient descent (of surprise, not error). It doesn't do strategic planning (unless it developed emergent complex machinery to do so).