"interpret our previous actions as a being attempting to do something, and then taking that something as our goal"
Wow, that's super not how I experience my value/goal setting. I mostly think of my previous self (especially >10 years ago) as highly misguided due to lacking key information that I now have. A lot of this is information I couldn't have expected that previous me to have, so I don't blame 'previous me' for this per say. I certainly don't try to align myself to an extrapolation of previous me's goals though!
Whereas my 'values' that are the underlying drivers behind my goals seem fairly constant throughout my life, and the main thing changing is my knowledge of the world. So my goals change, because my life situation and understanding of the workings of the world change. But values changing? subtly, slowly, perhaps. But mostly that just seems dangerous and bad to current me. Current me endorses current me's values! Or they wouldn't be my values!
the underlying drivers behind my goals seem fairly constant throughout my life
What are these specifically, and what type of thing are they? Were they there when you were born? Were they there "implicitly" but not "explicitly"? In what sense were they always there (since whenever you claim they started being there)?
Surely your instrumental goals change, and this is fine and is a result of learning, as you say. So when something changes, you say: Ah, this wasn't my values, this was instrumental goals. But how can you tell that there's something fixed that underlies or overarches all the changing stuff? What is it made of?
These are indeed the important questions!
My answers from introspection would say things like, "All my values are implicit, explicit labels are just me attempting to name a feeling. The ground truth is the feeling."
"Some have been with me for as long as I can remember, others seem to have developed over time, some changed over time."
My answers from neuroscience would be shaped like, "Well, we have these basic drives from our hypothalamus, brainstem, basal ganglia... and then our cortex tries to understand and predict these drives, and drives can change over time (esp w puberty for instance). If we were to break down where a value comes from it would have to be from some combination of these basic drives, cortical tendencies (e.g. vulnerability to optical illusions), and learned behavior."
"Genetics are responsible for a fetus developing a brain in the first place, and set a lot of parameters in our neural networks that can last a lifetime. Obviously, genetics has a large role to play in what values we start with and what values we develop over time."
My answers from reasoning about it abstractly would be something like, "If I could poll a lot of people at a lot of different ages, and analyze their introspective reports and their environmental circumstances and their life histories, then I could do analysis on what things change and what things stay the same."
"We can get clues about the difference between a value and an instrumental goal by telling people to consider a hypothetical scenario in which a fact X was true that isn't true in their current lives, and see how this changes their expectation of what their instrumental goals would be in that scenario. For example, when imagining a world where circumstances have changed such that money is no longer a valued economic token, I anticipate that I would have no desire for money in that world. Thus, I can infer that money is an instrumental goal."
Overall, I really feel uncertain about the truth of the matter and the validity of each of these ways of measuring. I think understanding values vs instrumental goals is important work that needs doing, and I think we need to consider all these paths to understanding unless we figure out a way to rule some out.
If we were to break down where a value comes from it would have to be from some combination of these basic drives, cortical tendencies (e.g. vulnerability to optical illusions), and learned behavior.
I wouldn't want to say this is false, but I'd want to say that speaking like this is a red flag that we haven't understood what values are in the appropriate basis. We can name some dimensions (the ones you list, and others), but then our values are rotated with respect to this basis; our values are some vector that cuts across these basis vectors. We lack the relevant concepts. When you say that you experience "the underlying drivers behind your goals" as being constant, I'm skeptical, not because I don't think there's something that's fairly fixed, but because we lack the concepts to describe that fixed thing, and so it's hard to see how you could have a clear experience of the fixedness. At most you could have a vague sense that perhaps there is something fixed. And if so, then I'd want to take that sense as a pointer toward the as-yet not understood ideas.
Yes, I think I'd go with the description: 'vague sense that there is something fixed, and a lived experience that says that if not completely fixed then certainly slow moving.'
and I absolutely agree that understanding on this is lacking.
"All my values are implicit, explicit labels are just me attempting to name a feeling. The ground truth is the feeling."
Can you elaborate? (I don't have a specific question, just double-clicking, asking for more detail or rephrasing that uses other concepts.)
I find that I'm not very uncertain about where values come from. Although the exact details of the mechanisms in complex systems like humans remain murky, to me it seems pretty clear that we already have the answer from cybernetics: they're the result of how we're "wired" up into feedback loops. There's perhaps the physics question of why is the universe full of feedback loops—and the metaphysics question of why does our universe have the physics it has—but given that the universe is full of feedback loops, values seem a natural consequence of this fact.
Agree somewhat though I think lack of confusion (already knowing) can go too far as well. Wanted to add that per Quine and later analyzed by Nozick, we seem to be homeostatic envelope extenders. That is, we start with maintaining homeostasis (which gets complex due to cognitive arms race social species stuff) and then add on being able to reason about things far from their original context in time and space and try to extend our homeostatic abilities to new contexts and increase their robustness over arbitrary timelines, locations, conditions.
This is a non-answer, and I wish you'd notice on your own that it's a non-answer. From the dialogue:
Really I want to know the shape of values as they sit in a mind. I want to know that because I want to make a mind that has weird-shaped values. Namely, Corrigibility.
So, given that you know where values come from, do you know what it looks like to have a deeply corrigible strong mind, clearly enough to make one? I don't think so, but please correct me if you do. Assuming you don't, I suggest that understanding what values are and where they come from in a more joint-carving way might help.
In other words, saying that, besides some details, values come as "the result of how we're "wired" up into feedback loops" is true enough, but not an answer. It would be like saying "our plans are the result of how our neurons fire" or "the Linux operating system is the result of how electrons move through the wires in my computer". It's not false, it's just not an answer to the question we were asking.
So, given that you know where values come from, do you know what it looks like to have a deeply corrigible strong mind, clearly enough to make one? I don't think so, but please correct me if you do. Assuming you don't, I suggest that understanding what values are and where they come from in a more joint-carving way might help.
Yes, understanding values better would be better. The case I've made elsewhere is that we can use cybernetics as the basis for this understanding. Hence my comment is to suggest that if you don't know where values come from, I can offer what I believe to be a model that answers where values ultimately come from and gives a good basis for building up a more detailed model of values. Others are doing the same with compatible models, e.g. predictive processing.
I've not thought deeply about corrigibility recently, but my thinking on outer alignment more generally has been that, because Goodhart is robust, we cannot hope to get fully aligned AI by any means that measures, which leaves us with building AI with goals that are already aligned with ours (it seems quite likely we're going to bootstrap to AI that helps us build this, though, so work on imperfect systems seems worthwhile, but I'll ignore it here). I expect a similar situation for building just corrigibility.
So to build a corrigible AI, my model says we need to find the configuration of negative feedback circuits that implement a corrigible process. That doesn't constrain the space to look in a lot, but it does some, and it makes it clear that what we have is an engineering rather than a theory challenge. I see this as advancing the question from "where do values come from?" to "how do I build a thing out of feedback circuits that has the values I want it to have?".