But the more interesting question is: what was happening during the thirty seconds that it took me to walk upstairs?I evidently had motivation to continue walking, or I would have stopped and turned around. But my brainstem hadn’t gotten any ground truth yet that there were good things happening. That’s where “defer-to-predictor mode” comes in! The brainstem, lacking strong evidence about what’s happening, sees a positive valence guess coming out of the striatum and says, in effect, “OK, sure, whatever, I’ll take your word for it.”
It seems like there's some implication here that motivation and positive valence are the same thing?
Is the claim that evolutionarily early versions of behavioral circuits had approximately the form...
If positive reward:
continue current behavior
else:
try something else
...but that adding in long-term predictors instead allows for the following algorithm?
One of the ways you can get up in the morning, if you are me, is by looking in the internal direction of your motor plans, and writing into your pending motor plan the image of you getting out of bed in a few moments, and then letting that image get sent to motor output and happen. (To be clear, I actually do this very rarely; it is just a fun fact that this is a way I can defeat bed inertia.)
I do this, or something very much like this.
For me, it's like the motion of setting a TAP, but to fire imminently instead of at some future trigger, by doing cycles of multi-sensory visualization of the behavior in question.
Perhaps I'm just being dense, but I'm confused why this toy model of a long-term predictor is long-term instead of short term. I'm trying to think through it aloud in this comment.
A “long-term predictor” is ultimately nothing more than a short-term predictor whose output signal helps determine its own supervisory signal. Here’s a toy model of what that can look like:
At first, I thought that the idea was that the latency of the supervisory/error signal was longer than average, and that that latency made the short term predictor function as a long-term predictor, without being any different functionally. But then why is it labeled "short-term predictor"?
It seems like the short-term predictor should learn to predict (based on context cues) the behavior triggered by the hardwired circuitry. But it should predict that behavior only 0.3 seconds early?
...
Oh! Is the key point that there's a kind of resonance, where this system maintains the behavior of the genetically hardwired components? When the switch switches back to defer-to-predictor mode, the short term predictor is still predicting the override hard-wired behavior, which is now trivially "correct", because whatever the predictor outputs is correct. (It was also correct a moment before, when the switch was in override mode, but not trivially correct.)
This still doesn't answer my confusion. It seems like the whole circuit is going to maintain the state from the last "ground truth infusion" and learn to predict the timings and magnitudes of the "ground truth infusions". But it still shouldn't predict them more than 0.3 seconds in advance?
Is the idea that the lookahead propagates earlier and earlier with each cycle? You start with a 0.3 second prediction. But that means that supervisory signal (when in the "defer-to-predictor mode") is 0.3 seconds earlier, which means that the predictor learns to predict the change in output 0.6 seconds ahead of when the override "would have happened", and then 0.9 seconds ahead , and then 1.2 seconds ahead, and so on, until it backs all the way up to when the "prior" ground truth infusion sent a different signal?
Like, the thing that this circuit is doing is simulating time travel, so that it can activate (on average) the next behavior that the genetically hardwired circuitry will output, as soon as "override mode" is turned off?
Who cares if a cortex by itself is safe? A cortex by itself was never the plan!
Well, to be fair, I care a lot about if a cortex by itself is safe, specifically because if so, the plan maybe should be to build a cortex (approximately) by itself, directed by control systems very different than those of biological brains—like text prompts.
And as discussed above (and more in later posts), even if the researchers start trying in good faith to give their AGI an innate drive for being helpful / docile / whatever, they might find that they don’t know how to do so.
Feel free not to respond if this is answered in later posts, but how relevant is it to your model that current LLMs (which are not brain-like and not AGIs), are helpful and docile in the vast majority of contexts?
Is this evidence that actually would be AGI developers do know how to making their AGIs helpful and docile? Or is it missing the point?
The "Singularity" claim assumes general intelligence
I'm not sure exactly how you're using the term "general intelligence", but why does the singularity assume that? Why can't an "instrumental intelligence" recursively self-improve and seize the the universes's available resources in service of it's goals?
but on our interpretation the orthogonality thesis says that one cannot consider this
The orthogonality thesis doesn't make any claims that agents can't consider various propositions. Agents can consider whatever propositions, but that doesn't mean they'll be moved by them.
To be more specific, I think this is a bootstrapping issue—I think we need a curiosity drive early in training, but can probably turn it off eventually. Specifically, let’s say there’s an AGI that’s generally knowledgeable about the world and itself, and capable of getting things done, and right now it’s trying to invent a better solar cell. I claim it probably doesn’t need to feel an innate curiosity drive. Instead it may seek new information, and seek surprises, as if it were innately curious, because it has learned through experience that seeking those things tends to be an effective strategy for inventing a better solar cell. In other words, something like curiosity can be motivating as a means to an end, even if it’s not motivating as an end in itself—curiosity can be a learned metacognitive heuristic. See instrumental convergence. But that argument does not apply early in training, when the AGI starts from scratch, knowing nothing about the world or itself. Instead, early in training, I think we really need the Steering Subsystem to be holding the Learning Subsystem’s hand, and pointing it in the right directions, if we want AGI.
Presumably another strategy would be to start with an already trained model as the center of our learning subsystem, and a steering subsystem that points to concepts in that trained model?
Something like, you have an LLM-based agent that can take actions in text-based game. There's some additional reward machinery that magically updates the weights of the LLM (based on simple heuristic evaluations of the the text context of the game?). You could presumably(?) instantiate such an agent such that it had some goals out of the gate, instead of needing to reward curiosity?
Perhaps this already strays too far from the human-setup to count as "brain-like."
Trying to solve philosophical problems like these on a deadline with intent to deploy them into AI is not a good plan, especially if you're planning to deploy it even if it's still highly controversial (i.e., a majority of professional philosophers think you are wrong).
If the majority of profesional philosophers do endorse your metaethics, how seriously should you take that?
And inversely, do you think it's implausible that you could have correctly reasoned your way to correct metaethics, as validated by a more narrow community of philosophers, but not yet have convinced everyone in the field?
The attitude of the sequences emphasizes often that most people in the world believe in god, so if you're interested in figuring out the truth, you gotta be comfortable confidently disclaiming widely held beliefs. What do you say to the person who assesses that academic philosophy is a sufficiently broken field with warped incentives that prevent intellectual progress, and thinks that they should discard the opinion of the whole thing?
Do you just claim that they're wrong about that, on the object level, and that hypothetical person should have more respect for the views of philosophers?
(That said, I'll observe that there's an important in practice asymmetry between "almost everyone is wrong in their belief of X, and I'm confident about that" and "I've independently reasoned my way to Y, and I'm very confident of it." Other people are wrong != I am right.)
It seems like there's some implication here that motivation and positive valence are the same thing?
Is the claim that evolutionarily early versions of behavioral circuits had approximately the form...
...but that adding in long-term predictors instead allows for the following algorithm?