I never came back to this paper after I briefly posted about it, and this seems as good a place as any to say more and continue the conversation.
What I found weird about this paper is that it seems to focus too much on something that seems largely irrelevant to me. I don't expect there to be much for us to choose about how to aggregate values, because I expect that most of the problem is in figuring out how to specify or find values at all. I do expect there to be some issues to resolve around aggregation, but not knowing yet what we will be aggregating (that is, what the abstractions we will be trying to deal with aggregation over and conflict resolution of) makes it hard to see how this kind of consideration is yet relevant.
To be fair, many may object that I am making the same mistake worrying about understanding what values even are and how we might be able to verify if AI are aligned with ours when we don't even know what AI powerful enough to need alignment will look like, so I wouldn't want to see this kind of work not happen, only that for my taste it seems like a premature thing to worry about that may be reasoning about things that won't be relevant or won't be relevant in the way we expect such that the work is of limited marginal value.
That said, I think this paper stands as an excellent signal, as you do, that more mainstream AI researchers are taking problems in value alignment more seriously and thinking about problems of the kind that are more likely, in my estimation, to be important long term than short term concerns about, for example, narrow value learning.
It sure seems like if he really grokked the philosophical and technical challenge of getting a GAI agent to be net beneficial, he would write a different paper. That first challenge sort of overshadows the task of dividing up the post-singularity pie.
But I'm not sure whether the overshadowing is merely by being bigger (in which case this paper is still doing useful work), or if we should expect that solutions to the pie-dividing problems (e.g. weighing egalitarianism vs. utilitarianism) will necessarily fall out of the process that lets the AI learn how to behave well.
If you buy a pizza cutter, but the pizza doesn't arrive, then you've wasted your money.
(Technically this is incorrect if you ever buy a pizza again, or there's something else you can use it to split, but I understand the main reason people have expressed concern about AGI is the belief that if it goes horribly wrong there won't be another chance to try again.)
Short:
I agree with MichaelA's questions about the paper.
Long:
Responses to quotes from the paper:
Furthermore, even if this were not the case and we came to have great confidence in the truth of a single moral theory, the proposed approach immediately encounters a second problem, namely that there would still be no way of reliably communicating this truth to others.
This seems incorrect - if we don't have "the one true theory" (assuming it exists), then how do we know it can't be reliably communicated? Though this may be close to hitting the nail on the head:
Designing AI in accordance with a single moral doctrine would therefore involve imposing a set of values and judgments on other people who did not agree with them
Unless the correct moral theory doesn't involve doing that? It seems like this is just changing the name of the search.
In the absence of moral agreement, is there a fair way to decide what principles AI should align with?
It's not clear what "fair" means here - but the paper might be about looking for "morality"/achieving one of its properties under a different name, as noted above.
This seems incorrect - if we don't have "the one true theory" (assuming it exists), then how do we know it can't be reliably communicated?
To be fair to the paper, I'm not sure that that specifically is as strong an argument as it might look. E.g., I don't have a proof for [some as-yet-unproven mathematical conjecture], but I feel pretty confident that if I did come up with such a proof, I wouldn't be able to reliably communicate it to just any given random person.
But note that there I'm saying "I feel pretty confident", and "I wouldn't be able to". So I think the issue is more in the "can't", and the implication that we couldn't fix that "can't" even if we tried, rather than in the fact these arguments are being applied to something we haven't discovered yet.
That said, I do think it's an interesting and valid point that the fact we haven't found that theory yet (again, assuming it exists) adds at least a small extra reason to believe it's possible we could communicate it reliably. For example, my second-hand impression is that some philosophers think "the true moral theory" would be self-evidently true, once discovered, and would be intrinsically motivating, or something like that. That seems quite unlikely to me, and I wouldn't want to rely on it at all, but I guess it is yet another reason why it's possible the theory could be reliably communicated.
And I guess even if the theory was not quite "self-evidently true" or "intrinsically motivating", it might still be shockingly simple, intuitive, and appealing, making it easier to reliably communicate than we'd otherwise expect.
Perhaps given that we don't know that it can be reliably communicated, we shouldn't rely on that.
Yes, I'd strongly agree with that. I sort-of want us to make as few assumptions on philosophical matters as possible, though I'm not really sure precisely what that means or what that looks like.
"Designing AI in accordance with a single moral doctrine would therefore involve imposing a set of values and judgments on other people who did not agree with them"
Unless the correct moral theory doesn't involve doing that?
To again be fair to the paper, I believe the argument is that, given the assumption (which I contest) that we definitely couldn't reliably convince everyone of the "correct moral theory", if we wanted to align an AI with that theory we'd effectively end up imposing that theory on people who didn't sign up for it.
You might have been suggesting that such an imposition might be explicitly prohibited by the correct moral theory, or something like that. But in that case, I think the problem is instead that we wouldn't be able to align the AI with that theory, without at least some contradictions, if people couldn't be convinced of the theory (which, again, I don't see as certain).
In January, DeepMind released a paper by Iason Gabriel called Artificial Intelligence, Values, and Alignment (author’s summary here; Rohin Shah's summary here). Here’s the abstract:
I found this an interesting paper, and overall I think that I agreed with a lot of it, and that I’d recommend it to people interested in the topic. My main hesitation would be that people who already know a lot about these topics might find most of what the paper says familiar. But I’d guess that even such people would probably still learn something, and that they might benefit from how the paper packages and explains the things they were already familiar with.
It also felt promising to see a paper released by one of the leading AI labs close with:
But there were a few things in the paper that I felt a bit unsure about, and two passages in particular that I want to quibble with. I’ll first quote the whole first passage and give my high-level views on it, for context, and then I'll get into specifics on that passage. I'll then quote and critique the second passage. (Note that my quibbles/critiques could be mistaken, and that I’d be interested in counterarguments people might have.)
The first passage
The passage in question is supporting the third proposition mentioned above, and goes as follows:
(Note: See here, and its comments, for more on “the long reflection”.)
Some of the claims in that passage are things I just outright agree with. And I think I’d outright agree with the whole passage if:
But as it was, the passage felt to me like it moved very fast and very confidently, in a way that gave me a sort of whiplash.
With that context in mind, I’ll now get into what I saw as the specific issues.
My first set of quibbles
I have no issue with the following claims, in themselves:
But I don’t see how those claims at all support the broader point being made: i.e., that “the task in front of us is not, as we might first think, to identify the true or correct moral theory and then implement it in machines”, and/or that we would not be able to come “to have great confidence in the truth of a single moral theory”.
Instead, it seems to me that we could believe that:
...and yet:
That is, perhaps there’s a true/correct moral theory that’s substantially different to any theory we currently know of, but that could be found by an excellently implemented process of “long reflection”, and that would be an excellent thing to align our AI systems with.
To be clear, I’m not saying we necessarily should believe the above three claims. And I’m certainly not saying that we should take confident, simple versions of them as assumptions when building AGI. (Personally, I do believe all of the above three claims, but with heavy, heavy emphasis on the “might” and the “perhaps”s, and so I’d want our principles for an AGI to be open to those possibilities but definitely not to rely on them.)
What I’m saying is just that I don’t at all see how the specific premise that “it is very unlikely that any single moral theory we can now point to captures the entire truth about morality” supports the conclusion that “the task in front of us is not, as we might first think, to identify the true or correct moral theory and then implement it in machines”.
My second set of quibbles
The following passage also seems strange to me:
Firstly, how does the empirical fact that, presently, “human beings hold a variety of reasonable but contrasting beliefs about value” strongly imply that, conditional on someone or some AI reliably identifying a true/correct moral theory, “there would still be no way of reliably communicating this truth to others”?
We’re probably talking about a very different world in this scenario where a true/correct moral theory has been reliably identified. I would assume we’d have had something like a “long reflection”, or major cognitive enhancement, or extremely advanced AI, or something like that. It seems totally plausible that what humans believe would be very different in that world.
Secondly, it does seem to me totally plausible, and perhaps likely, that even if someone had very good reason to believe they’d identified the true/correct moral theory, they still wouldn’t be able to “persuade other people of this truth using evidence and reason alone”. But that doesn’t seem certain. It doesn’t seem clear that “There would still be principled disagreement about how best to live”, or that “Designing AI in accordance with a single moral doctrine would therefore involve imposing a set of values and judgments on other people who did not agree with them” (emphasis added in both cases).
Again, I’d assume a world where identification of a true/correct moral theory has occurred is a very different world. It seems plausible that, in that world, people would be typically convinced of most true things just “using evidence and reason alone”.
And if that isn’t already the case by default, it seems plausible that some way around that obstacle could be developed, which we would not classify as a “form of domination” or as “imposing” values on people. For example, perhaps people’s intellectual capabilities could be raised to something approximating those of whatever entity had identified the true/correct moral theory. And/or perhaps people could be taken through something approximating the same lines of argument and evidence that had aided in the identification of that theory.
Thirdly, even if in practice this couldn’t be done - if not all people could be convinced - it seems debatable whether those people’s disagreement would be “principled disagreement”. It might make sense to view that disagreement as largely the result of lack of knowledge or faulty reasoning, to view it as wise to not try to fully factor in that disagreement when deciding what values AI systems should be aligned with.
Again, I’m not saying that I believe in the exact opposite claims to the claims the paper makes in this passage, or that we should assume that things will be easier than the paper suggests (see also this post). It just seems to me that the paper:
The second passage
(This is a less important point which there’s a higher chance I’m wrong about, given that I don’t have much formal philosophical training.)
The paper also discusses:
And the author writes that, if this approach was taken:
This seems to me to fit a common pattern of people thinking you need egalitarianism or prioritarianism to arrive at a conclusion that you can really arrive at with just standard utilitarianism, given the purely empirical fact that there’s diminishing marginal utility to many resources.[1]
For example, if I know that the same amount of money is more valuable for the poor than for the rich, that not being a slave is far more valuable to a slave than having a slave is to a “free man”, etc., then standard utilitarianism would lead me to work to ensure particular focus on benefitting the least well off. I wouldn’t need to be an egalitarian or prioritarian to reach that conclusion.
Indeed, that sort of logic has led a lot of roughly utilitarian EAs to focus primarily on helping the extremely poor, farm animals, wild animals, people suffering from mental health issues, etc. These EAs recognise that these groups are more disadvantaged, and thus that a given amount of resources can benefit them more than it can benefit the relatively well-off (generally speaking), and that alone is enough to indicate that one should perhaps focus on helping these groups.
So I think that, if I was purely self-interested and behind a veil of ignorance, I’d want society to be set up along roughly utilitarian lines, rather than specifically along prioritarian or egalitarian lines. I don’t think I’d want “the worst off” to be given extreme priority, beyond what utilitarianism would give them, because that’d make me lose out too much if I don’t end up in that position.
(From memory, Moral Tribes by Joshua Greene discusses this sort of general point very well.)
Disclaimers
Firstly, I should restate that I thought this was an interesting paper, and that overall I think that I agreed with a lot of it and that I’d recommend it to people interested in the topic. I’ve disproportionately focused on what I didn’t agree with about this paper, largely because I have little to add regarding the various points I did agree with.
Secondly, I should note that there’s a pretty impressive set of people in the “Acknowledgements” section, including people who seem to me very intelligent and worth paying attention to the views of. This updates me a little towards thinking my critiques are somehow just mistaken. (E.g., Joshua Greene, who I mentioned as supporting/informing one of my quibbles, is listed there.)
Thirdly, I’m aware that my first two quibbles seem very related to various discussions on LessWrong and elsewhere about whether a sufficiently powerful AI would necessarily discover “the moral truth”, or whether the “the moral truth” is intrinsically convincing. And I think this is also related to debates about internalism vs externalism, though I don’t know much about that. I haven’t explicitly discussed those debates here because:
Commentary on commentary
I might try to make a habit of writing reviews/commentaries/whatever as I read articles (e.g., this one). (Not counting articles that started on or are already linked to on LessWrong or the EA Forum, as in those cases I can just write comments.) The aims of this would be to:
Prompt me to more explicitly think through my vague sense of “this is very clever” or “something’s not quite right here”
Bring interesting articles to other people’s attention
Maybe productively change the beliefs of others or myself (e.g., through comments pushing back against my commentary)
I guess I'll see over time how valuable that seems to be. I think it also might be cool for more others to do that sort of thing more often (I'm aware that some people already do).
Although note that the paper doesn’t make that claim explicitly. And it does seem true that, if decision-makers started at various points other than utilitarianism, the “concerns” the paper notes would move them "in the direction of" egalitarian or prioritarian principles. ↩︎