All of trentbrick's Comments + Replies

This piece is super interesting, especially the toy models.

A few clarifying questions:

'And not just any learning algorithm! The neocortex runs a learning algorithm, but it's an algorithm that picks outputs to maximize reward, not outputs to anticipate matched supervisory signals. This needs to be its own separate brain module, I think.'

-- Why does it need to be its own separate module? Can you expand on this? And even if separate modules are useful (as per your toy models and different inputs, couldn't the neocortex also be running lookup table like au... (read more)

3Steven Byrnes
Thanks for the great questions!! Maybe you're ahead of me, but it took me until long after this post—just a couple weeks ago—to realize that you can take a neural circuit set up for RL, and jury-rig it to do supervised learning instead. I think this is a big part of the story behind what the vmPFC is doing. And, in a certain sense, the amygdala too. More on this in a forthcoming post. I think of the neocortex as doing analysis-by-synthesis—it searches through a space of generative models for one that matches the input data. There's a lot of recurrency—the signals bounce around until it settles into an equilibrium. For example in this model, there's a forward pass from the input data, and that activates some generative models that seem plausible. But it may be multiple models that are mutually inconsistent. For example, in the vision system, "Yarn" and "Yam" are sufficiently close that a feedforward pass would activate both possibilities simultaneously. Then there's this message-passing algorithm where the different possibilities compete to explain the data, and it settles on one particular compositional generative model. So this seems like a generally pretty slow inference algorithm. But the slowness is worth it, because the neocortex winds up understanding the input data, i.e. fitting it into its structured model of the world, and hence it can now flexibly query it, make predictions, etc. I think the cerebellum is much simpler than that, and closer to an actual lookup table, and hence presumably much faster. The cerebellum is also closer in proximity to the spinal cord, which reduces communication delays when reading proprioceptive nerves and commanding muscles. That paper says "it has been the dominant working assumption in the field that the CS is transmitted to the cerebellar cortex via the mossy fibres (mf) and parallel fibres (pf) whereas information about the US is provided by climbing fibres (cf) originating in the inferior olive", which is what I sai

Why does this approach only need to be implemented in neo-cortex like AGIs? If we have a factored series of value functions in an RL agent then we should be able to take the same approach? But I guess you are thinking that the basal ganglia learning algorithms already do this for us so it is a convenient approach?

Side note. I found the distinction between confusion and conflict a bit... confusing! Confusion here is the agent updating a belief while conflict is the agent deciding to take an action?

2Steven Byrnes
Oh, I wasn't saying that. I wouldn't know either way, I haven't thought about it. RL is a very big and heterogeneous field. I only know little bits and pieces of it. It's a lot easier to make a specific proposal that applies to a specific architecture—the architecture that I happen to be familiar with—than to try to make a more general proposal. So that's all I did. What do you mean by "factored series of value functions"? If you're thinking of my other post, maybe that's possible, although not what I had in mind, because humans can feel conflicted, but my other post is talking about a mechanism that does not exist in the human brain. Yeah, that's what I was going for.

Thanks for this post!

I agree with just about all of it (even though it paints a pretty bleak picture). It was useful to put all of these ideas and inner/outer alignment in one place especially the diagrams.

Two quotes that stood out to me:

" "Nameless pattern in sensory input that you’ve never conceived of” is a case where something is in-domain for the reward function but (currently) out-of-domain for the value function. Conversely, there are things that are in-domain for your value function—so you can like or dislike them—but wildly out-of-domain for ... (read more)

Thanks a lot for your detailed reply and sorry for my slow response (I had to take some exams!).

Regarding terminal goals the only compelling one I have come across is coherent extrapolated volition as outlined in Superintelligence. But how to even program this into code is of course problematic and I haven't followed the literature closely since for rebuttals or better ideas.

I enjoyed your piece on Steered Optimizers, and think it has helped give me examples where the algorithmic design and inductive biases can play a part in how controllable our system is... (read more)

2Steven Byrnes
Thanks for the gwern link! I think the most popular alternatives to CEV are "Do what I, the programmer, want you to do", argued most prominently by Paul Christiano (cf. "Approval-directed agents"), variations on that (Stuart Russell's book talks about showing a person different ways that their future could unfold and have them pick their favorite), task-limited AGI ("just do this one specific thing without causing general mayhem") (I believe Eliezer was advocating for solving this problem before trying to make a CEV maximizer), and lots of ideas for systems that don't look like agents with goals (e.g. CAIS). A lot of these "kick the can down the road" and don't try to answer big questions about the future, on the theory that future people with AGI helpers will be in a better position to figure out subsequent steps forward. Sure, and neither did Yann Lecun. I don't know whether a DNN would be more or less intepretable than a neocortex with the same information content. I think we desperately need a clearer vision for what "interpretability tools" would look like in both cases, such that they would scale all the way to AGI. I (currently) see no way around having intepretability be a big part of the solution. Strong agree. I do think we have a suite of social instincts which are largely common between people and hard-coded by evolution. But the instincts don't add up to an internally consistent framework of morality. I'm generally not assuming that we will run search processes that parallel what evolution did. I mean, maybe, I just don't think it's that likely, and it's not the scenario I'm trying to think through. I think people are very good at figuring out algorithms based on their desired input-output relations, and then coding them up, whereas evolution-like searches over learning algorithms is ridiculously computationally expensive and has little precedent. (E.g. we invented ConvNets, we didn't discover ConvNets by an evolutionary search.) Evolution has put l

Hi Steve, thanks for all of your posts.

It is unclear to me how this investigation into brain-like AGI will aid in safety research.

Can you provide some examples of what discoveries would indicate that this is an AGI route that is very dangerous or safe?

Without having thought about this much it seems to me like the control/alignment problem depends upon the terminal goals we provide the AGI rather than the substrate and algorithms it is running to obtain AGI level intelligence.

3Steven Byrnes
Ooh, good questions! I agree that if an AGI has terminal goals, they should be terminal goals that yield good results for the future. (And don't ask me what those are!) So that is indeed one aspect of the control/alignment problem. However, I'm not aware of any fleshed-out plan for AGI where you get to just write down its terminal goals, or indeed where the AGI necessarily even has terminal goals in the first place. (For example, what language would the terminal goals be written in?) Instead, the first step would be to figure out how to build the system such that it winds up having the goals that you want it to have. That part is called the "inner alignment problem", it is still an open problem, and I would argue that it's a different open problem for each possible AGI algorithm architecture—since different algorithms can acquire / develop goals via different processes. (See here for excessive detail on that.) Sure! * One possible path to a (plausibly) good AGI future is try to build AGIs that have a similar suite of moral intuitions that humans have. We want this to work really well, even in weird out-of-distribution situations (the future will be weird!), so we should ideally try to make this happen by using similar underlying algorithms as humans. Then they'll have the same inductive biases etc. This especially would involve human social instincts. I don't really understand how human social instincts work (I do have vague ideas I'm excited to further explore! But I don't actually know.) It may turn out to be the case that the algorithms supporting those instincts rely in a deep and inextricable way on having a human body and interacting with humans in a human world. That would be evidence that this potential path is not going to work. Likewise, maybe we'll find out that you fundamentally can't have, say, a sympathy instinct without also having, say, jealousy and self-interest. Again, that would be evidence that this potential path is not as promising as it i

Thank you for the kind words and flagging some terms to look out for in societal change approaches.

Fair enough but for it to be that powerful and used as part of our immune system we may be free of parasites because we are all dead xD.

Thanks for the informative comments. You make great points. I think the population structure of bats may have something to do with their unique immune response to these infections but definitely want to look at the bat immune system more.