An analogy that points at one way I think the instrumental/terminal goal distinction is confused:
Imagine trying to classify genes as either instrumentally or terminally valuable from the perspective of evolution. Instrumental genes encode traits that help an organism reproduce. Terminal genes, by contrast, are the "payload" which is being passed down the generations for their own sake.
This model might seem silly, but it actually makes a bunch of useful predictions. Pick some set of genes which are so crucial for survival that they're seldom if ever modified (e.g. the genes for chlorophyll in plants, or genes for ATP production in animals). Treating those genes as "terminal" lets you "predict" that other genes will gradually evolve in whichever ways help most to pass those terminal genes on, which is what we in fact see.
But of course there's no such thing as "terminal genes". What's actually going on is that some genes evolved first, meaning that a bunch of downstream genes ended up selected for compatibility with them. In principle evolution would be fine with the terminal genes being replaced, it's just that it's computationally difficult to find a way to do so without breaking downstream dependencies.
I think this is a good analogy for how human values work. We start off with some early values, and then develop instrumental strategies for achieving them. Those instrumental strategies become crystallized and then give rise to other instrumental strategies for achieving them, and so on. Understood this way, we can describe an organism's goals/strategies purely in terms of which goals "have power over" which other goals, which goals are most easily replaced, etc, without needing to appeal to some kind of essential "terminalism" that some goals have and others don't. (Indeed, the main reason you'd need that concept is to describe someone who has modified their goals towards having a sharper instrumental/terminal distinction—i.e. it's a self-fulfilling prophecy.)
That's the descriptive view. But "from the inside" we still want to know which goals we should pursue, and how to resolve disagreements between our goals. How to figure that out without labeling some goals as terminal and others as instrumental? I don't yet have a formal answer, but my current informal answer is that there's a lot of room for positive-sum trade between goals, and so you should set up a system which maximizes the ability of those goals to cooperate with each other, especially by developing new "compromise" goals that capture the most important parts of each.
This leads to a pretty different view of the world from the Bostromian one. It often feels like the Bostrom paradigm implicitly divides the future into two phases. There's the instrumental phase, during which your decisions are dominated by trying to improve your long-term ability to achieve your goals. And there's the terminal phase, during which you "cash out" your resources into whatever you value. This isn't a *necessary* implication of the instrumental/terminal distinction, but I expect it's an emergent consequence in a range of environments of taking the instrumental/terminal distinction seriously. E.g. in our universe it sure seems like any scale-sensitive value system should optimize purely for number of galaxies owned for a long time before trying to turn those galaxies into paperclips/hedonium/etc.
From the alternative perspective I've outlined above, though, the process of instrumentally growing and gaining resources is also simultaneously the process of constructing values. In other words, we start off with underspecified values as children, but then over time choose to develop them in ways which are instrumentally useful. This process leads to the emergence of new, rich, nuanced goals which satisfy our original goals while also going far beyond them, just as the development of complex multicellular organisms helps to propagate the original bacterial genes for chlorophyll or ATP—not by "maximizing" for those "terminal" genes, but by building larger creatures much more strange and wonderful.
One thing I had in an earlier draft of this shortform: the concept of "brain waves" makes me suspect that the timings of neural spikes are also best understood as discrete. But I don't know enough about how brain waves actually work (or what they even are) to say anything substantive here.
yes! this is an important point. I don't quite know how to cash it out yet but I suspect I will eventually converge towards viewing concepts as "agents" which are trying to explain as much sensory data as possible while also cooperating/competing with each other.
a neural spike either happens or not, you don't get partial spikes
An analogy that might be banal, but might be interesting:
One reason (the main reason?) that computers use discrete encodings is to make error correction easier. A continuous signal will gradually drift over time. Conversely, if the signal is frequently rounded to the nearest discrete value, then it might remain error-free for a long time. (I think this is also the reason why two most complicated biological information-processing systems use discrete encodings: DNA base pairs and neural spikes. EDIT: Neural spikes may seem continuous in the time dimension but the concept of "brain waves" makes me suspect that the time intervals between them are better understood as discrete.)
Separately, agents tend to define discrete boundaries around themselves—e.g. countries try to have sharp borders rather than fuzzy borders. One reason (the main reason?) is to make themselves easier to defend: with sharp borders there's a clear Schelling point for when to attack invaders. Without that, invaders might "drift in" over time.
The logistics of defending oneself vary by type of agent. For physical agents, perhaps fuzzy boundaries are just not possible to implement (e.g. humans need to literally hold the water inside us). However, many human groups (e.g. social classes) have initiation rituals which clearly demarcate who's in and who's out, even though in principle it'd be fairly easy for them to have a gradual/continuous metric of membership (like how many "points" members have gotten). We might be able to explain this as a way of giving them social defensibility.
Nice work. I would also be excited about someone running with a similar project but for de-censoring Western models (e.g. on some of the topics discussed in this curriculum).
Worth noting that "Only goal: get cryogenically preserved into the glorious transhumanist singularity" is a pretty sociopathic way to orient to the world.
But it's a fun premise and also more manageable/approachable than trying to write about steering civilization as a whole.
I didn't pay this post much attention when it came out. But rereading it now I find many parts of it insightful, including the description of streetlighting, the identification of the EA recruiting pipeline as an issue, and the "flinching away" model. And of course it's a "big if true" post, because it's very important for the field to be healthy.
I'm giving it +4 instead of +9 because I think that there's something implicitly backchainy about John's frame (you need to confront the problem without flinching away from it). But I also think you can do great alignment work by following your curiosity and research taste, if those are well-developed enough, without directly trying to "solve the problem". And so even the identification of alignment as being a field aimed at solving a given big problem, rather than a field aimed at developing a deep scientific understanding, is somewhat counterproductive IMO.
I have referred back to this post a lot since writing it. I still think it's underrated, because without understanding what we mean by "alignment research" it's easy to get all sorts of confused about what the field is trying to do.
No more than hiring new employees is purely negative for existing employees at a company.
The premise I'm working with here is that you can't create goals without making them "terminal" in some sense (just as you can't hire employees without giving them some influence over company culture).