Mostly the newer posts that you're reading are not aiming to come up with The One True Encoding Of Human Values, which is why people don't talk about these problems in relation to them. Rather, the hope is to create an AI system that does the specific things we ask of it, but ensures that we remain in control (see discussion of "corrigibility"). Such an AI system need not know The One True Encoding Of Human Values,
I don't understand why anybody would want anything that involved leaving humans in control, unless there were absolutely no alternative whatsoever.
I'm not joking or being hyperbolic; I genuinely don't get it. A lot of people seem to think that humans being in control is obviously good, but it seems really, really obvious to me that it's a likely path to horrible outcomes.
Humans haven't had access to all that much power for all that long, and we've already managed to create a number of conditions that look unstable and likely to go bad in catastrophic ways.
We're on a climate slide to who-knows-where. The rest of the environment isn't looking that good either. We've managed to avoid large-scale nuclear war for like 75 whole years after developing the capability, but that's not remotely long enough to call "stable". Those same 75 years have seen some reduction in war in general, but that looks like it's turning around as the political system evolves. Most human governments (and other institutions) are distinctly suboptimal on a bunch of axes, including willingness to take crazy risks, and, although you can argue that they've gotten better in maybe the last 100 to 150 years, a large number of them now seem to have stopped getting better and started getting worse. Humans in general are systematically rotten to each other, and most of the advancement we've gotten against that to come from probably unsustainable institutional tricks that limit anybody's ability to get the decisive upper hand.
If you gave humans control over more power, then why wouldn't you expect all of that to get even worse? And even if you could find a way to make such a situation stably not-so-bad, how would you manage the transition, where some humans would have more power than others, and all humans, including the currently advantaged ones, would feel threatened?
It seems to me that the obvious assumption is that humans being in control is bad. And trying to think out the mechanics of actual scenarios hasn't done anything to change that belief. How can anybody believe otherwise?
There's a difference between "AI putting humans in control is bad", and "AI putting humans in control is better than other options we seem to have for alignment." For many people, it may be as you mentioned:
I don't understand why anybody would want anything that involved leaving humans in control, unless there were absolutely no alternative whatsoever.
(I'm somewhat less pessimistic than you are, I think, but I agree it could go pretty damn poorly, for many ways the AI could "leave us in control.")
I don't have an alternative, and no I'm not very happy about that. I definitely don't know how to build a friendly AI. But, on the other hand, I don't see how "corrigibility" could work either, so in that sense they're on an equal footing. Nobody seems to have any real idea how to achieve either one, so why would you want to emphasize the one that seems less likely to lead to a non-sucky world?
Anyway, what I'm reacting to is this sense I get that some people assume that keeping humans in charge is good, and that humans not being in charge is in itself an unacceptable outcome, or at least weighs very heavily against the desirability of an outcome. I don't know if I've seen very many people say that, but I see lots of things that seem to assume it. Things people write seem to start out with "If we want to make sure humans are still in charge, then...", like that's the primary goal. And I do not think it should be a primary goal. Not even a goal at all, actually.
Nobody seems to have any real idea how to achieve either one
I think that's not true and we in fact have a much better idea of how to achieve corrigibility / intent alignment. (Not going to defend that here. You could see my comment here, though that one only argues why it might be easier rather than providing a method.)
Others will disagree with me on this.
humans not being in charge is in itself an unacceptable outcome, or at least weighs very heavily against the desirability of an outcome
The usual argument I'd give is "if humans aren't in charge, then we can't course correct if something goes wrong". It's instrumental, not terminal. If we ended up in a world like this where humans were not in charge, that seems like it could be okay depending on the details.
Another possibility is Posthuman Technocapital Singularity, everything goes in the same approximate direction, there are a lot of competing agents but without sharp destabilization or power concertation, and Moloch wins. Probably wins, idk
https://docs.osmarks.net/hypha/posthuman_technocapital_singularity
I consider the Arbital article on CEV the best reference for the topic. It says:
CEV is rather complicated and meta and hence not intended as something you'd do with the first AI you ever tried to build. CEV might be something that everyone inside a project agreed was an acceptable mutual target for their second AI. (The first AI should probably be a Task AGI.)
So MIRI doesn't focus on CEV, etc. because the world hasn't nailed down step one yet. We're extremely worried that humanity's on track to fail step one; and it doesn't matter how well we do on step two if we don't pull off step one. That doesn't mean that stopping at step one and never shooting for anything more ambitious would be acceptable; by default I'd consider that an existential catastrophe in its own right.
Yeah, CEV itself seemed like a long shot - but my thought process was that maintaining human control wouldn't be enough for step one, both because I think it's not enough at the limit, but also because the human component might inherently be a limiting factor that makes it not very competitive. But the more I thought about it, the weaker that assumption of inherent-ness seemed, so I agree in that the most this post could be saying is that the timeline gap between something like Task AGI and figuring out step 2 is short - but which I expect isn't very groundbreaking.
Also ,there's no proof that CEV would work. Maybe values are incoherent.
The arbital article is no help.
Asking what everyone would want* if they knew what the AI knew, and doing what they’d all predictably agree on, is just about the least jerky thing you can do.
How do we know that they would agree? That just begs the question. Saying that you shouldn't be "jerky", ie. selfish, doesn't tell you what kind of unselfishness to have instead. Clearly ,the left and the right don't agree on the best kind of altruism -- laying down your life to stop the spread of socialism, versus sacrificing your income to implement socialism.
So if I'm understanding it correctly, it's that maintaining human control is the best option we can formally work toward? The One True Encoding of Human Values would most likely be a more reliable system if we could create it, but that it's a much harder problem, and not strictly necessary for a good end outcome?
The One True Encoding of Human Values would most likely be a more reliable system if we could create it
My best guess is that this is a confused claim and I have trouble saying "yes" or "no" to it, but I do agree with the spirit of it.
but that it's a much harder problem, and not strictly necessary for a good end outcome?
Yes, in the same way that if you're worried about mosquitoes giving you malaria, you start by getting a mosquito net or moving to a place without mosquitoes, you don't immediately try to kill all mosquitoes in the world.
This is a totally fine thing to post :) I agree with most of the things, and agree about their importance.
I think the Goodhart's law side to CEV is more subtle. To be pithy, it's like it doesn't have a problem with Goodhart's law yet because it's not specific enough to even get noticed by Goodhart. If CEV hypothetically considers doing something bad, we can just reassure ourselves that surely that's not what our ideal advisor would have wanted. It's only once we pick a specific method of implementation that we have to confront in mechanistic detail what we could previously hide under the abstraction of anthropomorphic agency.
It's only once we pick a specific method of implementation that we have to confront in mechanistic detail what we could previously hide under the abstraction of anthropomorphic agency.
I agree. I was trying to think of possible implementation methods, throwing out various constraints like computing power or competitiveness as it became harder to find any, and the final sticking point was still Goodhart's Law. For the most part, I kept it in to give an example to the difficulty of meta-alignment (corrigibility in favourable directions).
Brute forcing extended high-fidelity simulations of all the humans that have ever lived in an attempt to formulate CEV will probably be too expensive for any first-generation AGI.
Prediction 1: that will never be a possibility, period, not just for a "first-generation" anything, but all the way out to the omega point. Not if you want the simulation to have enough fidelity to be useful for any practical purpose, or even interesting.
It probably won't even be possible, let alone cost effective, to do that for one person, since you'd have to faithfully simulate how the environment would interact with that person. Any set of ideas that relies on simulations like that is going to end up being useless.
I agree that this is probably true, but I wouldn't put it at at > 90% probability for far-future AI. With computing power greater than Jupiter brains, it probably still wouldn't be practical, but my point in thinking about it was that if it were possible to brute-force for first-generation AGI, then there's a chance for more efficient ways.
I’m Jose. I realized recently I wasn’t taking existential risk seriously enough, and in April, a year after I first applied, I started running a MIRIx group in my college. I’ll write summaries of the sessions that I thought were worth sharing. Most of the members are very new to FAI, so this will partly be an incentive to push upward and partly my own review process. Hopefully some of this will be helpful to others.
This one focuses on how aligning creator intent with the base objective of an AI might not be enough for outer alignment, starting with an overview of Coherent Extrapolated Volition and its flaws. This was created in collaboration with Jacob Abraham and Abraham Francis.
Coherent Extrapolated Volition
From the wiki,
In other words, CEV is an AI having not only a precise model of human values, but also the meta understanding of how to resolve contradictions and incompleteness in those values in a friendly way.
This line of research was considered obsolete by Eliezer, due to the problems it runs into - some of which make it appear like the proposal itself only shifts the pointers to the goals. In the time we spent discussing it, we ended up with a (most likely not comprehensive) list of the major flaws of CEV.
Criticism
However, CEV can be viewed as a proposal of what to do with AGI after problems like Goodhart’s Law have been solved.
Sobel considers two objections to the amnesia model:
This particular objection seems unlikely to me, but not obviously implausible.
I thought of CEV to begin this because it specifically targets the assumption that human values themselves might not be enough.
Inner and Outer Alignment
There is little consensus on a definition for the entire alignment problem, but a large part, intent alignment, i.e. making sure the AI does what the programmers want it to, is composed of two components: inner and outer alignment.
Inner Alignment is about making sure the AI actually optimizes what our reward function specifies. In other words, the reward function is the base objective, the objective an AI can search for optimizers to implement. But in its search, the AI may find proxy objectives that are easier to optimize, and do the job fairly well (think of evolution, where the base objective is reproductive fitness, while the mesa objective includes heuristics like pain aversion, status signaling, etc.). This is the mesa objective. Inner Alignment is aligning the base objective with the mesa objective.
Outer Alignment is about making sure we like the reward function we’re training the AI for. That is, if we had a model that solves inner alignment and was actually optimizing for the objective it’s given, would we like that model? This is the centre of much of classical alignment discussion (the paperclip AI thought experiment, for example).
Recall that what CEV addresses is the potential for aligning our intent with the base objective to be insufficient, that a model that optimizes an objective we like can still fail in the limit as it runs into inconsistencies or other problems with our value systems. The friendly resolution of these problems may be beyond a base human or human model at test time; far from necessarily so, but I think with enough probability in at least a few instances to be a problem.
Insufficient Values
Note: Epistemic status on the following is speculative at best, and is based on what posts and papers we could read in the time we had.
Based on my limited understanding of Outer Alignment, it doesn’t include a formalization of aligning AI with the values we would hold at the limit. Some of the proposals we looked at also ran into this problem.
Imitative amplification, for example, relies on a model that tries to imitate a human with access to the model. With oversight using transparency tools to account for deceptive or other harmful behaviour, it is plausibly outer aligned. However, a base human may not be able to reliably resolve in a friendly way the contradictions and inconsistencies it would face at the limit. That’s fairly uncharted territory, and might involve the human model diverging from the human template too far. I don’t think the oversight would be of much help here either, because it isn’t necessary that these problems would come up as early as training time. It’s also possible nearly any sort of resolution would seem misaligned to us.
Some proposals bypass this problem altogether, but at terrible cost. STEM AI, for example, avoids value modelling entirely, but does so by ignoring the class of use cases where those would be relevant.
It’s possible that we wouldn’t need to worry about this problem at all. Perhaps they’ll be addressed during training time, or instead of resolving inconsistencies, the AI could account for them as new value axioms. But while the former may even be the likely scenario, the alternative still holds a distinct probability, especially in realistic scenarios where training until hypothetical future value conflicts are resolved isn’t competitive. Treating inconsistencies as new axioms could likely be dangerous, and might not even solve the core problem because of an endless chain of new inconsistencies as we add new ones, in Godelian fashion.
Endnote: I hesitated for a while before posting this because it felt like something that must have been addressed already. I didn’t find much commenting on this in any of the posts we went through though, so I just peppered this with what was possibly an irksome number of uncertainty qualifiers. Whatever we got wrong, tell us.