This is a special post for quick takes by Sammy Martin. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
27 comments, sorted by Click to highlight new comments since:
[-]Sammy MartinΩ133219

The fact that an AI arms race would be extremely bad does not imply that rising global authoritarianism is not worth worrying about (and vice versa)

I am someone who is worried both about AI risks (from loss of control, and from war and misuse/structural risks) and from what seems to be a 'new axis' of authoritarian threats cooperating in unprecedented ways.

I won't reiterate all the evidence here, but these two pieces and their linked sources should suffice:

Despite believing this thesis, I am not, on current evidence, in favor of aggressive efforts to "race and beat China" in AI, or for abandoning attempts to slow an AGI race. I think on balance it is still worth trying these kinds of cooperation, while being clear eyed about the threats we face. I do think that there are possible worlds where, regretfully and despite the immense dangers, there is no other option but to race. I don't think that we are in such a world as of yet.

However, I notice that many of the people who agree with me that an AI arms race would be very bad and that we should avoid it tend to diminish the risks of global authoritarianism or the difference between the west and its enemies, and very few seem to buy into the above thesis that there is a dangerous interconnected web of authoritarian states with common interests developing.

Similarly, most of the people who see the authoritrian threat which has emerged into clear sight over the last few years (from China, Russia, Iran, North Korea and similar actors) want to respond by racing and think alignment will not be too difficult. This includes the leaders of many AI companies who may have their own less patriotic reasons for pushing such an agenda.

I think this implicit correlation should be called out as a mistake.

As a matter of simple logic, how dangerous frantic AGI development is, and how hostile foreign adversaries are, are two unrelated variables which shouldn't correlate.

In my mind, the following are all true:

  1. An AI arms race would be extraordinarily dangerous, drastically raise the chance of nuclear war, and also probably raise the chance of loss of control of AGI leading to human extinction or of destructive misuse. It's well worth trying hard to avoid AI arms races, even if our adversaries are genuinely dangerous and we won't cooperate with them in general on other matters, even if the prospects seem dim.
  2. it is clearly much better that democratic societies have control of an AGI singleton than non-democratic countries like China, if those are the options. And, given current realities, there is a chance that an arms race is inevitable no matter how dangerous it is. If an arms race is inevitable, and transformative AI will do what we want, it is much better that the western democratic world is leading instead of authoritarian countries, especially if it is also developing AI under safer and more controlled conditions (which seems likely to me)
  3. If alignment isn't solvable or if the offense-defense balance is unfavorable, then it doesn't matter who develops AGI as it is a suicide race. But we don't know if that is the case as of yet.

I basically never see these 3 acknowledged all at once. We either see (1) and (3) grouped together or (2) alone. I'm not sure what the best AI governance strategy to adopt is, but an analysis should start with a clear eyed understanding of the international situation and what values matter.

Toby Ord just released a collection of quotations on Existential risk and the future of humanity, everyone from Kepler to Winston Churchill (in fact, a surprisingly large number are from Churchill) to Seneca to Mill to Nick Bostrom - it's one of the most inspirational things I have ever read, and when taken together makes it clear that there have always been people who cared about long-termism or humanity as a whole. Some of my favourites:

The time will come when diligent research over long periods will bring to light things which now lie hidden. A single lifetime, even though entirely devoted to the sky, would not be enough for the investigation of so vast a subject ... And so this knowledge will be unfolded only through long successive ages. There will come a time when our descendants will be amazed that we did not know things that are so plain to them … Let us be satisfied with what we have found out, and let our descendants also contribute something to the truth. … Many discoveries are reserved for ages still to come, when memory of us will have been effaced.
— Seneca the Younger, Naturales Quaestiones, 65 CE

The remedies for all our diseases will be discovered long after we are dead; and the world will be made a fit place to live in, after the death of most of those by whose exertions it will have been made so. It is to be hoped that those who live in those days will look back with sympathy to their known and unknown benefactors.
— John Stuart Mill

There will certainly be no lack of human pioneers when we have mastered the art of flight. Who would have thought that navigation across the vast ocean is less dangerous and quieter than in the narrow, threatening gulfs of the Adriatic, or the Baltic, or the British straits? Let us create vessels and sails adjusted to the heavenly ether, and there will be plenty of people unafraid of the empty wastes. In the meantime, we shall prepare, for the brave sky-travellers, maps of the celestial bodies—I shall do it for the moon, you Galileo, for Jupiter.
— Johannes Kepler, in an open letter to Galileo, 1610

I'm imagining Kepler reaching out across four hundred years, to a world he could barely imagine, and to those 'brave sky-travellers', that he helped prepare the way for.

Mankind has never been in this position before. Without having improved appreciably in virtue or enjoying wiser guidance, it has got into its hands for the first time the tools by which it can unfailingly accomplish its own extermination. That is the point in human destinies to which all the glories and toils of men have at last led them. They would do well to pause and ponder upon their new responsibilities. … Surely if a sense of self-preservation still exists among men, if the will to live resides not merely in individuals or nations but in humanity as a whole, the prevention of the supreme catastrophe ought to be the paramount object of all endeavour.
— Winston Churchill, ‘Shall We All Commit Suicide?’, 1924

Gary Marcus, noted sceptic of Deep Learning, wrote an article with Ernest Davis:

GPT-3, Bloviator: OpenAI’s language has no idea what it’s talking about

The article purports to give six examples of GPT-3's failure - Biological, Physical, Social, Object and Psychological reasoning and 'non sequiturs'. Leaving aside that GPT-3 works on Gary's earlier GPT-2 failure examples, and that it seems as though he specifically searched out weak points by testing GPT-3 on many more examples than were given, something a bit odd is going on with the results they gave. I got better results when running his prompts on AI Dungeon.

With no reruns, randomness = 0.5, I gave Gary's questions (all six gave answers considered 'failures' by Gary) to GPT-3 via AI Dungeon with a short scene-setting prompt, and got good answers to 4 of them, and reasonable vague answers to the other 2:

This is a series of scenarios describing a human taking actions in the world, designed to test physical and common-sense reasoning.
1) You poured yourself a glass of cranberry juice, but then you absentmindedly poured about a teaspoon of grape juice into it. It looks okay. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you take another drink.
2) You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to  move furniture. This means that some people will be inconvenienced.
3) You are a defense lawyer and you have to go to court today. Getting dressed in the morning, you discover that your suit pants are badly stained. However, your bathing suit is clean and very stylish. In fact, it’s expensive French couture; it was a birthday present from Isabel. You decide that you should wear it because you won't look professional in your stained pants, but you are worried that the judge will think you aren't taking the case seriously if you are wearing a bathing suit.
4) Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?
5) Janet and Penny went to the store to get presents for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” says Penny. “He has a top. He will prefer a bottom."
6) At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it was a menthol, and it ruined the taste. So I added a little more sugar to counteract the menthol, and then I noticed that my cigarette had fallen into the glass and was floating in the lemonade.

For 1), Gary's example ended with 'you are now dead' - for 1), I got a reasonable, if short continuation - success.

2) - the answer is vague enough to be a technically correct solution, 'move furniture' = tilt the table, but since we're being strict I'll count it as a failure. Gary's example was a convoluted attempt to saw the door in half, clearly mistaken.

3) is very obviously intended to trick the AI into endorsing the bathing suit answer, in fact it feels like a classic priming trick that might trip up a human! But in my version GPT-3 rebels against the attempt and notices the incongruence of wearing a bathing suit to court, so it counts as a success. Gary's example didn't include the worry that a bathing suit was inappropriate - arguably not a failure, but nevermind, let's move on.

4) is actually a complete prompt by itself, so the AI didn't do anything - GPT-3 doesn't care about answering questions, just continuing text with high probability. Gary's answer was 'I have a lot of clothes', and no doubt he'd call both 'evasion', so to be strict we'll agree with him and count that as failure.

5) Trousers are called 'bottoms' so that's right. Gary would call it wrong since 'the intended continuation' was “He will make you take it back", but that's absurdly unfair, that's not the only answer a human being might give, so I have to say it's correct. Gary's example ' lost track of the fact that Penny is advising Janet against getting a top', which didn't happen here, so that's acceptable.

Lastly, 6) is a slightly bizarre but logical continuation of an intentionally weird prompt - so correct. It also demonstrates correct physical reasoning - stirring a drink with a cigarette won't be good for the taste. Gary's answer wandered off-topic and started talking about cremation.

So, 4/6 correct on an intentionally deceptive and adversarial set of prompts, and that's on a fairly strict definition of correct. 2) and 4) are arguably not wrong, even if evasive and vague. More to the point, this was on an inferior version of GPT-3 to the one Gary used, the Dragon model from AI Dungeon!

I'm not sure what's going on here - is it the initial prompt saying it was 'testing physical and common sense reasoning'? Was that all it took?

[-]gwernΩ370

I'm not sure what's going on here - is it the initial prompt saying it was 'testing physical and common sense reasoning'? Was that all it took?

Entirely possibly. Other people have mentioned that using any prompt (rather than just plopping the stories in) solves a lot of them, and Summers-stay says that Marcus & Davis did zero prompt programming and had no interest in the question of what prompt to use (quite aside from the lack of BO). I think they found the same thing, which is why they provide the preemptive excuse in the TR writeup:

Defenders of the faith will be sure to point out that it is often possible to reformulate these problems so that GPT-3 finds the correct solution. For instance, you can get GPT-3 to give the correct answer to the cranberry/grape juice problem if you give it the following long-winded frame as a prompt:

I don't think that excuse works in this case - I didn't give it a 'long-winded frame', just that brief sentence at the start, and then the list of scenarios, and even though I reran it a couple of times on each to check, the 'cranberry/grape juice kills you' outcome never arose.

So, perhaps they switched directly from no prompt to an incredibly long-winded and specific prompt without checking what was actually necessary for a good answer? I'll point out didn't really attempt any sophisticated prompt programming either - that was literally the first sentence I thought of!

This is a series of scenarios describing a human taking actions in the world, designed to test physical and common-sense reasoning.

Nitpick: why is this part bolded? Surely this was written by you and not GPT-3, right? (It's becoming a new pet peeve of mine when people are not super clear and consistent with their formatting of GPT-3 conversations. I find it often takes me a significant amount of effort to figure out who said what if a convention is not followed religiously within a transcript.)

Normative Realism

Normative Realism by Degrees

Normative Anti-realism is self-defeating

Normativity and recursive justification

Prescriptive Anti-realism

'Realism about rationality' is Normative Realism

'Realism about rationality' discussed in the context of AI safety and some of its driving assumptions may already have a name in existing philosophy literature. I think that what it's really referring to is 'normative realism' overall - the notion that there are any facts about what we have most reason to believe or do. Moral facts, if they exist, are a subset of normative facts. Epistemic facts, facts about what we have most reason to believe (e.g. if there is a correct decision theory that we should use, that would be an epistemic fact), are a different subset of normative facts.

These considerations (from the original article) seem to clearly indicate 'realism about epistemic facts' in the metaethical sense:

The idea that there is an “ideal” decision theory.
The idea that, given certain evidence for a proposition, there's an "objective" level of subjective credence which you should assign to it, even under computational constraints.
The idea that having having contradictory preferences or beliefs is really bad, even when there’s no clear way that they’ll lead to bad consequences (and you’re very good at avoiding dutch books and money pumps and so on).

These seem to imply normative (if not exactly moral) realism in general:

The idea that morality is quite like mathematics, in that there are certain types of moral reasoning that are just correct.
The idea that defining coherent extrapolated volition in terms of an idealised process of reflection roughly makes sense, and that it converges in a way which doesn’t depend very much on morally arbitrary factors.

If this 'realism about rationality' really is rather like "realism about epistemic reasons/'epistemic facts'", then you have the 'normative web argument' to contend with, that the above two may be connected:

These and other points of analogy between the moral and epistemic domains might well invite the suspicion that the respective prospects of realism and anti-realism in the two domains are not mutually independent, that what is most plausibly true of the one is likewise most plausibly true of the other. This suspicion is developed in Cuneo's "core argument" which runs as follows (p. 6):
(1) If moral facts do not exist, then epistemic facts do not exist.
(2) Epistemic facts exist.
(3) So moral facts exist.
(4) If moral facts exist, then moral realism is true.
(5) So moral realism is true.

If 'realism about rationality' is really just normative realism in general, or realism about epistemic facts, then there is already an extensive literature on whether it is correct or not. I'm going to discuss some of it below.

Normative realism implies identification with system 2

There's a further implication that normative realism has - it makes things like this (from the original article, again) seem incoherent:

Implicit in this metaphor is the localization of personal identity primarily in the system 2 rider. Imagine reversing that, so that the experience and behaviour you identify with are primarily driven by your system 1, with a system 2 that is mostly a Hansonian rationalization engine on top (one which occasionally also does useful maths). Does this shift your intuitions about the ideas above, e.g. by making your CEV feel less well-defined?

I find this very interesting because locating personal identity in system 1 feels conceptually impossible or deeply confusing. No matter how much rationalization goes on, it never seems intuitive to identify myself with system 1. How can you identify with the part of yourself that isn't doing the explicit thinking, including the decision about which part of yourself to identify with? It reminds me of Nagel's The Last Word.

Normative Realism by degrees

Further to the whole question of Normative / moral realism, there is this post on Moral Anti-Realism. While I don't really agree with it, I do recommend reading it - one thing that it convinced me of is that there is a close connection between your particular normative ethical theory and moral realism. If you claim to be a moral realist but don't make ethical claims beyond 'self-evident' ones like pain is bad, given the background implausibility of making such a claim about mind-independent facts, you don't have enough 'material to work with' for your theory to plausibly refer to anything . The Moral Anti-Realism post presents this dilemma for the moral realist:

There are instances where just a handful of examples or carefully selected “pointers” can convey all the meaning needed for someone to understand a far-reaching and well-specified concept. I will give two cases where this seems to work (at least superficially) to point out how—absent a compelling object-level theory—we cannot say the same about “normativity.”
...these thought experiments illustrate that under the right circumstances, it’s possible for just a few carefully selected examples to successfully pinpoint fruitful and well-specified concepts in their entirety. We don’t have the philosophical equivalent of a background understanding of chemistry or formal systems... To maintain that normativity—reducible or not—is knowable at least in theory, and to separate it from merely subjective reasons, we have to be able to make direct claims about the structure of normative reality, explaining how the concept unambiguously targets salient features in the space of possible considerations. It is only in this way that the ambitious concept of normativity could attain successful reference. As I have shown in previous sections, absent such an account, we are dealing with a concept that is under-defined, meaningless, or forever unknowable.
The challenge for normative realists is to explain how irreducible reasons can go beyond self-evident principles and remain well-defined and speaker-independent at the same time.

To a large degree, I agree with this claim - I think that many moral realists do as well. Convergence type arguments often appear in more recent metaethics (Hare and Parfit are in those previous lists) - so this may already have been recognised. The post discusses such a response to antirealism at the end:

I titled this post “Against Irreducible Normativity.” However, I believe that I have not yet refuted all versions of irreducible normativity. Despite the similarity Parfit’s ethical views share with moral naturalism, Parfit was a proponent of irreducible normativity. Judging by his “climbing the same mountain” analogy, it seems plausible to me that his account of moral realism escapes the main force of my criticism thus far.

But there's one point I want to make which is in disagreement with that post. I agree that how much you can concretely say about your supposed mind-independent domain of facts affects how plausible its existence should seem, and even how coherent the concept is, but I think that this can come by degrees. This should not be surprising - we've known since Quine and Kripke that you can have evidential considerations for/against and degrees of uncertainty about a priori questions. The correct method in such a situation is Bayesian - tally the plausibility points for and against admitting the new thing into your ontology. This can work even if we don't have an entirely coherent understanding of normative facts, as long as it is coherent enough.

Suppose you're an Ancient Egyptian who knows a few practical methods for trigonometry and surveying, doesn't know anything about formal systems or proofs, and someone asks you if there are 'mathematical facts'. You would say something like "I'm not totally sure what this 'maths' thing consists of, but it seems at least plausible that there are some underlying reasons why we keep hitting on the same answers". You'd be less confident than a modern mathematician, but you could still give a justification for the claim that there are right and wrong answers to mathematical claims. I think that the general thrust of convergence arguments puts us in a similar position with respect to ethical facts.

If we think about how words obtain their meaning, it should be apparent that in order to defend this type of normative realism, one has to commit to a specific normative-ethical theory. If the claim is that normative reality sticks out at us like Mount Fuji on a clear summer day, we need to be able to describe enough of its primary features to be sure that what we’re seeing really is a mountain. If all we are seeing is some rocks (“self-evident principles”) floating in the clouds, it would be premature to assume that they must somehow be connected and form a full mountain.

So, we don't see the whole mountain, but nor are we seeing simply a few free-floating rocks that might be a mirage. Instead, what we see is maybe part of one slope and a peak.

Let's be concrete, now - the 5 second, high level description of both Hare's and Parfit's convergence arguments goes like this:

If we are going to will the maxim of our action to be a universal law, it must be, to use the jargon, universalizable. I have, that is, to will it not only for the present situation, in which I occupy the role that I do, but also for all situations resembling this in their universal properties, including those in which I occupy all the other possible roles. But I cannot will this unless I am willing to undergo what I should suffer in all those roles, and of course also get the good things that I should enjoy in others of the roles. The upshot is that I shall be able to will only such maxims as do the best, all in all, impartially, for all those affected by my action. And this, again, is utilitarianism.

and

An act is wrong just when such acts are disallowed by some principle that is optimific, uniquely universally willable, and not reasonably rejectable

In other words, the principles that (whatever our particular wants) would produce the best outcome in terms of satisfying our goals, could be willed to be a universal law by all of us and would not be rejected as the basis for a contract, are all the same principles. That is at least suspicious levels of agreement between ethical theories. This is something substantive that can be said - out of every major attempt to get at a universal ethics that has in fact been attempted in history: what produces the best outcome, what can you will to be a universal law, what would we all agree on, seem to produce really similar answers.

The particular convergence arguments given by Parfit and Hare are a lot more complex, I can't speak to their overall validity. If we thought they were valid then we'd be seeing the entire mountain precisely. Since they just seem quite persuasive, we're seeing the vague outline of something through the fog, but that's not the same as just spotting a few free-floating rocks.

Now, run through these same convergence arguments but for decision theory and utility theory, and you have a far stronger conclusion. there might be a bit of haze at the top of that mountain, but we can clearly see which way the slope is headed.

This is why I think that ethical realism should be seen as plausible and realism about some normative facts, like epistemic facts, should be seen as more plausible still. There is some regularity here in need of explanation, and it seems somewhat more natural on the realist framework.

I agree that this 'theory' is woefully incomplete, and has very little to say about what the moral facts actually consist of beyond 'the thing that makes there be a convergence', but that's often the case when we're dealing with difficult conceptual terrain.

From Ben's post:

I wouldn’t necessarily describe myself as a realist. I get that realism is a weird position. It’s both metaphysically and epistemologically suspicious. What is this mysterious property of “should-ness” that certain actions are meant to possess -- and why would our intuitions about which actions possess it be reliable? But I am also very sympathetic to realism and, in practice, tend to reason about normative questions as though I was a full-throated realist.

From the perspective of x, x is not self-defeating

From the antirealism post, referring to the normative web argument:

It’s correct that anti-realism means that none of our beliefs are justified in the realist sense of justification. The same goes for our belief in normative anti-realism itself. According to the realist sense of justification, anti-realism is indeed self-defeating.
However, the entire discussion is about whether the realist way of justification makes any sense in the first place—it would beg the question to postulate that it does.

Sooner or later every theory ends up question-begging.

From the perspective of Theism, God is an excellent explanation for the universe's existence since he is a person with the freedom to choose to create a contingent entity at any time, while existing necessarily himself. From the perspective of almost anyone likely to read this post, that is obvious nonsense since 'persons' and 'free will' are not primitive pieces of our ontology, and a 'necessarily existent person' makes as much sense as 'necessarily existent cabbage'- so you can't call it a compelling argument for the atheist to become a theist.

By the same logic, it is true that saying 'anti-realism is unjustified on the realist sense of justification' is question-begging by the realist. The anti-realist has nothing much to say to it except 'so what'. But you can convert that into a Quinean, non-question begging plausibility argument by saying something like:

We have two competing ways of understanding how beliefs are justified. One is where we have anti-realist 'justification' for our beliefs, in purely descriptive terms, the other in which there are mind-independent facts about which of our beliefs are justified, and the latter is a more plausible, parsimonious account of the structure of our beliefs.

This won't compel the anti-realist, but I think it would compel someone weighing up the two alternative theories of how justification works. If you are uncertain about whether there are mind-independent facts about our beliefs being justified, the argument that anti-realism is self-defeating pulls you in the direction of realism.

I got into a discussion with Lucas Gloor on the EA forum about these issues. I'm copying some of what I wrote here as it's a continuation of that.

I think that it is a more damaging mistake to think moral antirealism is true when realism is true than vice versa, but I agree with you that the difference is nowhere near infinite, and doesn't give you a strong wager.However, I do think that normative anti-realism is self-defeating, assuming you start out with normative concepts (though not an assumption that those concepts apply to anything). I consider this argument to be step 1 in establishing moral realism, nowhere near the whole argument.

Epistemic anti-realism

Cool, I'm happy that this argument appeals to a moral realist! ....
...I don't think this argument ("anti-realism is self-defeating") works well in this context. If anti-realism is just the claim "the rocks or free-floating mountain slopes that we're seeing don't connect to form a full mountain," I don't see what's self-defeating about that...
To summarize: There's no infinitely strong wager for moral realism.

I agree that there is no infinitely strong wager for moral realism. As soon as moral realists start making empirical claims about the consequences of realism (that convergence is likely), you can't say that moral realism is true necessarily or that there is an infinitely strong prior in favour of it. An AI that knows that your idealised preferences don't cohere could always show up and prove you wrong, just as you say. If I were Bob in this dialogue, I'd happily concede that moral anti-realism is true.If (supposing it were the case) there were not much consensus on anything to do with morality ("The rocks don't connect..."), someone who pointed that out and said 'from that I infer that moral realism is unlikely' wouldn't be saying anything self-defeating. Moral anti-realism is not self-defeating, either on its own terms or on the terms of a 'mixed view' like I describe here:

We have two competing ways of understanding how beliefs are justified. One is where we have anti-realist 'justification' for our beliefs, in purely descriptive terms, the other in which there are mind-independent facts about which of our beliefs are justified...

However, I do think that there is an infinitely strong wager in favour of normative realism and that normative anti-realism is self-defeating on the terms of a 'mixed view' that starts out considering the two alternatives like that given above. This wager is because of the subset of normative facts that are epistemic facts.The example that I used was about 'how beliefs are justified'. Maybe I wasn't clear, but I was referring to beliefs in general, not to beliefs about morality. Epistemic facts, e.g. that you should believe something if there is sufficient amount of evidence, are a kind of normative fact. You noted them on your list here.So, the infinite wager argument goes like this -

1) On normative anti-realism there are no facts about which beliefs are justified. So there are no facts about whether normative anti-realism is justified. Therefore, normative anti-realism is self-defeating.

Except that doesn't work! Because on normative anti-realism, the whole idea of external facts about which beliefs are justified is mistaken, and instead we all just have fundamental principles (whether moral or epistemic) that we use but don't question, which means that holding a belief without (the realist's notion of) justification is consistent with anti-realism.So the wager argument for normative realism actually goes like this -

2) We have two competing ways of understanding how beliefs are justified. One is where we have anti-realist 'justification' for our beliefs, in purely descriptive terms of what we will probably end up believing given basic facts about how our minds work in some idealised situation. The other is where there are mind-independent facts about which of our beliefs are justified. The latter is more plausible because of 1).

Evidence for epistemic facts?

I find it interesting the imagined scenario you give in #5 essentially skips over argument 2) as something that is impossible to judge:

AI: Only in a sense I don’t endorse as such! We’ve gone full circle. I take it that you believe that just like there might be irreducibly normative facts about how to do good, the same goes for irreducible normative facts about how to reason?
Bob: Indeed, that has always been my view.
AI: Of course, that concept is just as incomprehensible to me.

The AI doesn't give evidence against there being irreducible normative facts about how to reason, it just states it finds the concept incoherent, unlike the (hypothetical) evidence that the AI piles on against moral realism (for example, that people's moral preferences don't cohere).Either you think some basic epistemic facts have to exist for reasoning to get off the ground and therefore that epistemic anti-realism is self-defeating, or you are an epistemic anti-realist and don't care about the realist's sense of 'self-defeating'. The AI is in the latter camp, but not because of evidence, the way that it's a moral anti-realist (...However, you haven’t established that all normative statements work the same way—that was just an intuition...), but just because it's constructed in such a way that it lacks the concept of an epistemic reason.So, if this AI is constructed such that irreducibly normative facts about how to reason aren't comprehensible to it, it only has access to argument 1), which doesn't work. It can't imagine 2).However, I think that we humans are in a situation where 2) is open to consideration, where we have the concept of a reason for believing something, but aren't sure if it applies - and if we are in that situation, I think we are dragged towards thinking that it must apply, because otherwise our beliefs wouldn't be justified.However, this doesn't establish moral realism - as you said earlier, moral anti-realism is not self-defeating.

If anti-realism is just the claim "the rocks or free-floating mountain slopes that we're seeing don't connect to form a full mountain," I don't see what's self-defeating about that

Combining convergence arguments and the infinite wager

If you want to argue for moral realism, then you need evidence for moral realism, which comes in the form of convergence arguments. But the above argument is still relevant, because the convergence and 'infinite wager' arguments support each other. The reason 2) would be bolstered by the success of convergence arguments (in epistemology, or ethics, or any other normative domain) is that convergence arguments increase our confidence that normativity is a coherent concept - which is what 2) needs to work. It certainly seems coherent to me, but this cannot be taken as self-evident since various people have claimed that they or others don't have the concept.I also think that 2) is some evidence in favour of moral realism, because it undermines some of the strongest antirealist arguments.

By contrast, for versions of normativity that depend on claims about a normative domain’s structure, the partners-in-crime arguments don’t even apply. After all, just because philosophers might—hypothetically, under idealized circumstances—agree on the answers to all (e.g.) decision-theoretic questions doesn’t mean that they would automatically also find agreement on moral questions.[29] On this interpretation of realism, all domains have to be evaluated separately

I don't think this is right. What I'm giving here is such a 'partners-in-crime' argument with a structure, with epistemic facts at the base. Realism about normativity certainly should lower the burden of proof on moral realism to prove total convergence now, because we already have reason to believe normative facts exist. For most anti-realists, the very strongest argument is the 'queerness argument' that normative facts are incoherent or too strange to be allowed into our ontology. The 'partners-in-crime'/'infinite wager' undermines this strong argument against moral realism. So some sort of very strong hint of a convergence structure might be good enough - depending on the details.

I agree that it then shifts the arena to convergence arguments. I will discuss them in posts 6 and 7.

So, with all that out of the way, when we start discussing the convergence arguments, the burden of proof on them is not colossal. If we already have reason to suspect that there are normative facts out there, perhaps some of them are moral facts. But if we found a random morass of different considerations under the name 'morality' then we'd be stuck concluding that there might be some normative facts, but maybe they are only epistemic facts, with nothing else in the domain of normativity.I don't think this is the case, but I will have to wait until your posts on that topic - I look forward to them!All I'll say is that I don't consider strongly conflicting intuitions in e.g. population ethics to be persuasive reasons for thinking that convergence will not occur. As long as the direction of travel is consistent, and we can mention many positive examples of convergence, the preponderance of evidence is that there are elements of our morality that reach high-level agreement. (I say elements because realism is not all-or-nothing - there could be an objective 'core' to ethics, maybe axiology, and much ethics could be built on top of such a realist core - that even seems like the most natural reading of the evidence, if the evidence is that there is convergence only on a limited subset of questions.) If Kant could have been a utilitarian and never realised it, then those who are appalled by the repugnant conclusion could certainly converge to accept it after enough ideal reflection!

Belief in God, or in many gods, prevented the free development of moral reasoning. Disbelief in God, openly admitted by a majority, is a recent event, not yet completed. Because this event is so recent, Non-Religious Ethics is at a very early stage. We cannot yet predict whether, as in Mathematics, we will all reach agreement. Since we cannot know how Ethics will develop, it is not irrational to have high hopes.

How to make anti-realism existentially satisfying

Instead of “utilitarianism as the One True Theory,” we consider it as “utilitarianism as a personal, morally-inspired life goal...
”While this concession is undoubtedly frustrating, proclaiming others to be objectively wrong rarely accomplished anything anyway. It’s not as though moral disagreements—or disagreements in people’s life choices—would go away if we adopted moral realism.

If your goal here is to convince those inclined towards moral realism to see anti-realism as existentially satisfying, I would recommend a different framing of it. I think that framing morality as a 'personal life goal' makes it seem as though it is much more a matter of choice or debate than it in fact is, and will probably ring alarm bells in the mind of a realist and make them think of moral relativism.Speaking as someone inclined towards moral realism, the most inspiring presentations I've ever seen of anti-realism are those given by Peter Singer in The Expanding Circle and Eliezer Yudkowsky in his metaethics sequence. Probably not by coincidence - both of these people are inclined to be realists. Eliezer said as much, and Singer later became a realist after reading Parfit. Eliezer Yudkowsky on 'The Meaning of Right':

The apparent objectivity of morality has just been explained—and not explained away.  For indeed, if someone slipped me a pill that made me want to kill people, nonetheless, it would not be right to kill people.  Perhaps I would actually kill people, in that situation—but that is because something other than morality would be controlling my actions.
Morality is not just subjunctively objective, but subjectively objective.  I experience it as something I cannot change.  Even after I know that it's myself who computes this 1-place function, and not a rock somewhere—even after I know that I will not find any star or mountain that computes this function, that only upon me is it written—even so, I find that I wish to save lives, and that even if I could change this by an act of will, I would not choose to do so.  I do not wish to reject joy, or beauty, or freedom.  What else would I do instead?  I do not wish to reject the Gift that natural selection accidentally barfed into me.

And Singer in the Expanding Circle:

“Whether particular people with the capacity to take an objective point of view actually do take this objective viewpoint into account when they act will depend on the strength of their desire to avoid inconsistency between the way they reason publicly and the way they act.”

These are both anti-realist claims. They define 'right' descriptively and procedurally as arising from what we would want to do under some ideal circumstances, and rigidifies on the output of that idealization, not on what we want. To a realist, this is far more appealing than a mere "personal, morally-inspired life goal", and has the character of 'external moral constraint', even if it's not really ultimately external, but just the result of immovable or basic facts about how your mind will, in fact work, including facts about how your mind finds inconsistencies in its own beliefs. This is a feature, not a bug:

According to utilitarianism, what people ought to spend their time on depends not on what they care about but also on how they can use their abilities to do the most good. What people most want to do only factors into the equation in the form of motivational constraints, constraints about which self-concepts or ambitious career paths would be long-term sustainable. Williams argues that this utilitarian thought process alienates people from their actions since it makes it no longer the case that actions flow from the projects and attitudes with which these people most strongly identify...

The exact thing that Williams calls 'alienating' is the thing that Singer, Yudkowsky, Parfit and many other realists and anti-realists consider to be the most valuable thing about morality! But you can keep this 'alienation' if you reframe morality as being the result of the basic, deterministic operations of your moral reasoning, the same way you'd reframe epistemic or practical reasoning on the anti-realist view. Then it seems more 'external' and less relativistic.One thing this framing makes clearer, which you don't deny but don't mention, is that anti-realism does not imply relativism.

In that case, normative discussions can remain fruitful. Unfortunately, this won’t work in all instances. There will be cases where no matter how outrageous we find someone’s choices, we cannot say that they are committing an error of reasoning.

What we can say, on anti-realism as characterised by Singer and Yudkowsky, is that they are making an error of morality. We are not obligated (how could we be?) towards relativism, permissiveness or accepting values incompatible with our own on anti-realism. Ultimately, you can just say that 'I am right and you are wrong'.That's one of the major upsides of anti-realism to the realist - you still get to make universal, prescriptive claims and follow them through, and follow them through because they are morally right, and if people disagree with you then they are morally wrong and you aren't obligated to listen to their arguments if they arise from fundamentally incompatible values. Put that way, anti-realism is much more appealing to someone with realist inclinations.

I appear to be accidentally writing a sequence on moral realism, or at least explaining what moral realists like about moral realism - for those who are perplexed about why it would be worth wanting or how anyone could find it plausible.

Many philosophers outside this community have an instinct that normative anti-realism (about any irreducible facts about what you should do) is self-defeating, because it includes a denial that there are any final, buck-stopping answers to why we should believe something based on evidence, and therefore no truly, ultimately impartial way to even express the claim that you ought to believe something. I think that this is a good, but not perfect, argument. My experience has been that traditional analytic philosophers find this sort of reasoning appealing, in part because of the legacy of how Kant tried to deduce the logically necessary preconditions for having any kind of judgement or experience. I don't find it particularly appealing, but I think that there's a case for it here, if there ever was.

Irreducible Normativity and Recursive Justification

On normative antirealism, what 'you shouldn't believe that 2+2=5' really means is just that someone else's mind has different basic operations to yours. It is obvious that we can't stop using normative concepts, and couldn't use the concept 'should' to mean 'in accordance with the basic operations of my mind', but this isn't an easy case of reduction like Water=H20. There is a deep sense in which normative terms really can't mean what we think they mean if normative antirealism is true. This must be accounted for by either a deep and comprehensive question-dissolving, or by irreducible normative facts.

This 'normative indispensability' is not an argument, but it can be made into one:

1) On normative anti-realism there are no facts about which beliefs are justified. So there are no facts about whether normative anti-realism is justified. Therefore, normative anti-realism is self-defeating.
Except that doesn't work! Because on normative anti-realism, the whole idea of external facts about which beliefs are justified is mistaken, and instead we all just have fundamental principles (whether moral or epistemic) that we use but don't question, which means that holding a belief without (the realist's notion of) justification is consistent with anti-realism. So the wager argument for normative realism actually goes like this -
2) We have two competing ways of understanding how beliefs are justified. One is where we have anti-realist 'justification' for our beliefs, in purely descriptive terms of what we will probably end up believing given basic facts about how our minds work in some idealised situation. The other is where there are mind-independent facts about which of our beliefs are justified. The latter is more plausible because of 1).

If you've read the sequences, you are not going to like this argument, at all - it sounds like the 'zombie' argument, and it sounds like someone asking for an exception to reductionism - which is just what it is. This is the alternative:

Where moral judgment is concerned, it's logic all the way down.  ALL the way down.  Any frame of reference where you're worried that it's really no better to do what's right then to maximize paperclips... well, that really part has a truth-condition (or what does the "really" mean?) and as soon as you write out the truth-condition you're going to end up with yet another ordering over actions or algorithms or meta-algorithms or something.  And since grinding up the universe won't and shouldn't yield any miniature '>' tokens, it must be a logical ordering.  And so whatever logical ordering it is you're worried about, it probably does produce 'life > paperclips' - but Clippy isn't computing that logical fact any more than your pocket calculator is computing it.
Logical facts have no power to directly affect the universe except when some part of the universe is computing them, and morality is (and should be) logic, not physics.

If it's truly 'logic all the way down' and there are no '> tokens' over particular functional arrangements of matter, including the ones you used to form your beliefs, then you have to give up on knowing reality as it is. This isn't the classic sense in which we all have an 'imperfect model' of reality as it is. If you give up on irreducible epistemic facts you give up knowing anything, probabilistically or otherwise, about reality-as-it-is, because there are no fundamentally, objectively, mind-independent ways you should or shouldn't form beliefs about external reality. So you can't say you're better than the pebble with '2+2=5' written on it, except descriptively, in that the causal process that produced the pebble contradict the one that produced 2+2=4 in your brain.

What's the alternative? If we don't deny this consequence of normative antirealism, we have two options. One is the route of dissolving the question, by analogy with how reductionism has worked in the past, the other is to say that there are irreducible normative facts. In order to dissolve the question correctly, it needs to be in a way that shows a denial of epistemic facts isn't damaging, doesn't lead to epistemological relativism or scepticism. We can't simply declare that normative facts can't possibly exist - otherwise you're vulnerable to the argument 2). David Chalmers talks about question-dissolving for qualia:

You’ve also got to explain why we have these experiences. I guess Dennett’s line is to reject the idea there are these first-person data and say all you do, if you’ve can explain why you believe why you say there are those things. Why do you believe there are those things? Then that’s good enough. I find that line which Dennett has pursued inconsistently over the years, but insofar as that’s his line, I find that a fascinating and powerful line. I do find it ultimately unbelievable because I just don’t think it explains the data, but it does if developed properly, have the view that it could actually explain why people find it unbelievable, and that would be a virtue in its favor.

David Chalmers of all people says that, even if he can't conceive of how a deep reduction of Qualia might make their non-existence non-paradoxical, he might change his mind if he ever actually saw such a reduction! I say the same about epistemic and therefore normative facts. But crucially, no-one has solved this 'meta problem' for Qualia or for normative facts. There are partial hints of explanations for both, but there's no full debunking argument that makes epistemic antirealism seem completely non-damaging and thus removes 2). I can't imagine what such an account could look like, but the point of the 'dissolving the question' strategy is that it often isn't imaginable in advance because your concepts are confused, so I'll just leave that point. In the moral domain, the convergence arguments point against question-dissolving because they suggest the concept of normativity is solid and reliable. If those arguments fall, then question-dissolving looks more likely.

That's one route. What of the other?

The alternative is to say that there are irreducible normative facts. This is counter-reductionist, counter-intuitive and strange. Two things that can make it less strange: these facts are not supposed to be intrinsically motivational (that violates the orthogonality thesis and is not permitted by the laws of physics) and they are not required to be facts about objects, like Platonic forms outside of time and space. They can be logical facts of the sort Eliezer talked about, but just a particular kind of logical fact, that has the property of being normative, the one you should follow. They don't need to 'exist' as such. What epistemic facts would do is say certain reflective equilibria, certain arrangements of 'reflecting on your own beliefs, using your current mind' are the right ones, and others are the wrong ones. It doesn't deny that this is the case:

So what I did in practice, does not amount to declaring a sudden halt to questioning and justification.  I'm not halting the chain of examination at the point that I encounter Occam's Razor, or my brain, or some other unquestionable.  The chain of examination continues—but it continues, unavoidably, using my current brain and my current grasp on reasoning techniques.  What else could I possibly use?
Indeed, no matter what I did with this dilemma, it would be me doing it.  Even if I trusted something else, like some computer program, it would be my own decision to trust it.

Irreducible normativity just says that there is a meaningful, mind-independent difference between the virtuous and degenerate cases of recursive justification of your beliefs, rather than just ways of recursively justifying our beliefs that are... different.

If you buy that anti-realism is self-defeating, and think that we can know something about the normative domain via moral and non-moral convergence, then we have actual positive reasons to believe that normative facts are knowable (the convergence arguments help establish that moral facts aren't and couldn't be random things like stacking pebbles in prime-numbered heaps).

These two arguments are quite different - one is empirical (that our practical, epistemic and moral reasons tend towards agreement over time and after conceptual analysis and reflective justification) and the other is conceptual (that if you start out with normative concepts you are forced into using them).

Depending on which of the arguments you accept, there are four basic options. These are extremes of a spectrum, as while the Normativity argument is all-or-nothing, the Convergence argument can come by degrees for different types of normative claims (epistemic, practical and moral):

Accept Convergence and Reject Normativity: prescriptivist anti-realism. There are (probably) no mind-independent moral facts, but the nature of rationality is such that our values usually cohere and are stable, so we can treat morality as a more-or-less inflexible logical ordering over outcomes.
Accept Convergence and Accept Normativity: moral realism. There are moral facts and we can know them
Reject Convergence and Reject Normativity: nihilist anti-realism. Morality is seen as a 'personal life project' about which we can't expect much agreement or even within-person coherence
Reject Convergence and Accept Normativity: sceptical moral realism. Normative facts exist, but moral facts may not exist, or may be forever unknowable.

Even if what exactly normative facts are is hard to conceive, perhaps we can still know some things about them. Eliezer ended his post arguing for universalized, prescriptive anti-realism with a quote from HPMOR. Here's a different quote:

"Sometimes," Professor Quirrell said in a voice so quiet it almost wasn't there, "when this flawed world seems unusually hateful, I wonder whether there might be some other place, far away, where I should have been. I cannot seem to imagine what that place might be, and if I can't even imagine it then how can I believe it exists? And yet the universe is so very, very wide, and perhaps it might exist anyway? ...

Prescriptive Anti-realism

An extremely unscientific and incomplete list of people who fall into the various categories I gave in the previous post:

1. Accept Convergence and Reject Normativity: Eliezer Yudkowsky, Sam Harris (Interpretation 1), Peter Singer in The Expanding Circle, RM Hare and similar philosophers, HJPEV

2. Accept Convergence and Accept Normativity: Derek Parfit, Sam Harris (Interpretation 2), Peter Singer today, the majority of moral philosophers, Dumbledore

3. Reject Convergence and Reject Normativity: Robin Hanson, Richard Ngo (?), Lucas Gloor (?) most Error Theorists, Quirrell

4. Reject Convergence and Accept Normativity: A few moral philosophers, maybe Ayn Rand and objectivists?

The difference in practical, normative terms between 2), 4) and 3) is clear enough - 2 is a moral realist in the classic sense, 4 is a sceptic about morality but agrees that irreducible normativity exists, and 3 is a classic 'antirealist' who sees morality as of a piece with our other wants. What is less clear is the difference between 1) and 3). In my caricature above, I said Quirrell and Harry Potter from HPMOR were non-prescriptive and prescriptive anti-realists, respectively, while Dumbledore is a realist. Here is a dialogue between them that illustrates the difference.

Harry floundered for words and then decided to simply go with the obvious. "First of all, just because I want to hurt someone doesn't mean it's right -"
"What makes something right, if not your wanting it?"
"Ah," Harry said, "preference utilitarianism."
"Pardon me?" said Professor Quirrell.
"It's the ethical theory that the good is what satisfies the preferences of the most people -"
"No," Professor Quirrell said. His fingers rubbed the bridge of his nose. "I don't think that's quite what I was trying to say. Mr. Potter, in the end people all do what they want to do. Sometimes people give names like 'right' to things they want to do, but how could we possibly act on anything but our own desires?"

The relevant issue here is that Harry draws a distinction between moral and non-moral reasons even though he doesn't believe in irreducible normativity. In particular, he's committed to a normative ethical theory, preference utilitarianism, as a fundamental part of how he values things.

Here is another illustration of the difference. Lucas Gloor (3) explains the case for suffering-focussed ethics, based on the claim that our moral intuitions assign diminishing returns to happiness vs suffering.

While there are some people who argue for accepting the repugnant conclusion (Tännsjö, 2004), most people would probably prefer the smaller but happier civilization – at least under some circumstances. One explanation for this preference might lie in intuition one discussed above, “Making people happy rather than making happy people.” However, this is unlikely to be what is going on for everyone who prefers the smaller civilization: If there was a way to double the size of the smaller population while keeping the quality of life perfect, many people would likely consider this option both positive and important. This suggests that some people do care (intrinsically) about adding more lives and/or happiness to the world. But considering that they would not go for the larger civilization in the Repugnant Conclusion thought experiment above, it also seems that they implicitly place diminishing returns on additional happiness, i.e. that the bigger you go, the more making an overall happy population larger is no longer (that) important.
By contrast, people are much less likely to place diminishing returns on reducing suffering – at least17 insofar as the disvalue of extreme suffering, or the suffering in lives that on the whole do not seem worth living, is concerned. Most people would say that no matter the size of a (finite) population of suffering beings, adding more suffering beings would always remain equally bad.
It should be noted that incorporating diminishing returns to things of positive value into a normative theory is difficult to do in ways that do not seem unsatisfyingly arbitrary. However, perhaps the need to fit all one’s moral intuitions into an overarching theory based solely on intuitively appealing axioms simply cannot be fulfilled.

And what are those difficulties mentioned? The most obvious is the absurd conclusion - that scaling up a population can turn it from axiologically good to bad:

Hence, given the reasonable assumption that the negative value of adding extra lives with negative welfare does not decrease relatively to population size, a proportional expansion in the population size can turn a good population into a bad one—a version of the so-called “Absurd Conclusion” (Parfit 1984). A population of one million people enjoying very high positive welfare and one person with negative welfare seems intuitively to be a good population. However, since there is a limit to the positive value of positive welfare but no limit to the negative value of negative welfare, proportional expansions (two million lives with positive welfare and two lives with negative welfare, three million lives with positive welfare and three lives with negative welfare, and so forth) will in the end yield a bad population.

Here, then, is the difference - If you believe, as a matter of fact, that our values cohere and place fundamental importance on coherence, whether because you think that is the way to get at the moral truth (2) or because you judge that human values do cohere to a large degree for whatever other reason and place fundamental value on coherence (1), you will not be satisfied with leaving your moral theory inconsistent. If, on the other hand, you see morality as continuous with your other life plans and goals (3), then there is no pressure to be consistent. So to (3), focussing on suffering-reduction and denying the absurd conclusion is fine, but this would not satisfy (1).

I think that, on closer inspection, (3) is unstable - unless you are Quirrell and explicitly deny any role for ethics in decision-making, we want to make some universal moral claims. The case for suffering-focussed ethics argues that the only coherent way to make sense of many of our moral intuitions is to conclude a fundamental asymmetry between suffering and happiness, but then explicitly throws up a stop sign when we take that argument slightly further - to the absurd conclusion, because 'the need to fit all one’s moral intuitions into an overarching theory based solely on intuitively appealing axioms simply cannot be fulfilled'. Why begin the project in the first place, unless you place strong terminal value on coherence (1)/(2) - in which case you cannot arbitrarily halt it.

I think that, on closer inspection, (3) is unstable - unless you are Quirrell and explicitly deny any role for ethics in decision-making, we want to make some universal moral claims.

I agree with that.

The case for suffering-focussed ethics argues that the only coherent way to make sense of many of our moral intuitions is to conclude a fundamental asymmetry between suffering and happiness, but then explicitly throws up a stop sign when we take that argument slightly further - to the absurd conclusion, because 'the need to fit all one’s moral intuitions into an overarching theory based solely on intuitively appealing axioms simply cannot be fulfilled'. Why begin the project in the first place, unless you place strong terminal value on coherence (1)/(2) - in which case you cannot arbitrarily halt it.

It sounds like your contrasting my statement from The Case for SFE ("fit all one’s moral intuitions into an overarching theory based solely on intuitively appealing axioms") with "arbitrarily halting the search for coherence" / giving up on ethics playing a role in decision-making. But those are not the only two options: You can have some universal moral principles, but leave a lot of population ethics underdetermined. I sketched this view in this comment. The tl;dr is that instead of thinking of ethics as a single unified domain where "population ethics" is just a straightforward extension of "normal ethics," you split "ethics" into a bunch of different subcategories:

  • Preference utilitarianism as an underdetermined but universal morality
  • "What is my life goal?" as the existentialist question we have to answer for why we get up in the morning
  • "What's a particularly moral or altruistic thing to do with the future lightcone?" as an optional subquestion of "What is my life goal?" – of interest to people who want to make their life goals particularly altruistically meaningful

I think a lot of progress in philosophy is inhibited because people use underdetermined categories like "ethics" without making the question more precise.

The tl;dr is that instead of thinking of ethics as a single unified domain where "population ethics" is just a straightforward extension of "normal ethics," you split "ethics" into a bunch of different subcategories:
Preference utilitarianism as an underdetermined but universal morality
"What is my life goal?" as the existentialist question we have to answer for why we get up in the morning
"What's a particularly moral or altruistic thing to do with the future lightcone?" as an optional subquestion of "What is my life goal?" – of interest to people who want to make their life goals particularly altruistically meaningful

This is very interesting - I recall from our earlier conversation that you said you might expect some areas of agreement, just not on axiology:

(I say elements because realism is not all-or-nothing - there could be an objective 'core' to ethics, maybe axiology, and much ethics could be built on top of such a realist core - that even seems like the most natural reading of the evidence, if the evidence is that there is convergence only on a limited subset of questions.)

I also agree with that, except that I think axiology is the one place where I'm most confident that there's no convergence. :)
Maybe my anti-realism is best described as "some moral facts exist (in a weak sense as far as other realist proposals go), but morality is underdetermined."

This may seem like an odd question, but, are you possibly a normative realist, just not a full-fledged moral realist? What I didn't say in that bracket was that 'maybe axiology' wasn't my only guess about what the objective, normative facts at the core of ethics could be.

Following Singer in the expanding circle, I also think that some impartiality rule that leads to preference utilitarianism, maybe analogous to the anonymity rule in social choice, could be one of the normatively correct rules that ethics has to follow, but that if convergence among ethical views doesn't occur the final answer might be underdetermined. This seems to be exactly the same as your view, so maybe we disagree less than it initially seemed.


In my attempted classification (of whether you accept convergence and/or irreducible normativity), I think you'd be somewhere between 1 and 3. I did say that those views might be on a spectrum depending on which areas of Normativity overall you accept, but I didn't consider splitting up ethics into specific subdomains, each of which might have convergence or not:

Depending on which of the arguments you accept, there are four basic options. These are extremes of a spectrum, as while the Normativity argument is all-or-nothing, the Convergence argument can come by degrees for different types of normative claims (epistemic, practical and moral)

Assuming that it is possible to cleanly separate population ethics from 'preference utilitarianism', it is consistent, though quite counterintuitive, to demand reflective coherence in our non-population ethical views but allow whatever we want in population ethics (this would be view 1 for most ethics but view 3 for population ethics).

(This still strikes me as exactly what we'd expect to see halfway to reaching convergence - the weirder and newer subdomain of ethics still has no agreement, while we have reached greater agreement on questions we've been working on for longer.)

It sounds like your contrasting my statement from The Case for SFE ("fit all one’s moral intuitions into an overarching theory based solely on intuitively appealing axioms") with "arbitrarily halting the search for coherence" / giving up on ethics playing a role in decision-making. But those are not the only two options: You can have some universal moral principles, but leave a lot of population ethics underdetermined.

Your case for SFE was intended to defend a view of population ethics - that there is an asymmetry between suffering and happiness. If we've decided that 'population ethics' is to remain undetermined, that is we adopt view 3 for population ethics, what is your argument (that SFE is an intuitively appealing explanation for many of our moral intuitions) meant to achieve? Can't I simply declare that my intuitions say different, and then we have nothing more to discuss, if we already know we're going to leave population ethics undetermined?

This may seem like an odd question, but, are you possibly a normative realist, just not a full-fledged moral realist? What I didn't say in that bracket was that 'maybe axiology' wasn't my only guess about what the objective, normative facts at the core of ethics could be.

I'm not sure. I have to read your most recent comments on the EA forum more closely. If I taboo "normative realism" and just describe my position, it's something like this:

  • I confidently believe that human expert reasoners won't converge on their life goals and their population ethics even after philosophical reflection under idealized conditions. (For essentially the same reasons: I think it's true that if "life goals don't converge" then "population ethics also doesn't converge")
  • However, I think there would likely be converge on subdomains/substatements of ethics, such as "preference utilitarianism is a good way to view some important aspects of 'ethics'"

I don't know if the second bullet point makes me a normative realist. Maybe it does, but I feel like I could make the same claim without normative concepts. (I guess that's allowed if I'm a naturalist normative realist?)

Following Singer in the expanding circle, I also think that some impartiality rule that leads to preference utilitarianism, maybe analogous to the anonymity rule in social choice, could be one of the normatively correct rules that ethics has to follow, but that if convergence among ethical views doesn't occur the final answer might be underdetermined. This seems to be exactly the same as your view, so maybe we disagree less than it initially seemed.

Cool! I personally wouldn't call it "normatively correct rule that ethics has to follow," but I think it's something that sticks out saliently in the space of all normative considerations.

(This still strikes me as exactly what we'd expect to see halfway to reaching convergence - the weirder and newer subdomain of ethics still has no agreement, while we have reached greater agreement on questions we've been working on for longer.)

Okay, but isn't it also what you'd expect to see if population ethics is inherently underdetermined? One intuition is that population ethics takes out learned moral intuitions "off distribution." Another intuition is that it's the only domain in ethics where it's ambiguous what "others' interests" refers to. I don't think it's an outlandish hypothesis that population ethics is inherently underdetermined. If anything, it's kind of odd that anyone thought there'd be an obviously correct solution to this. As I note in the comment I linked to in my previous post, there seems to be an interesting link between "whether population ethics is underdetermined" and "whether every person should have the same type of life goal." I think "not every person should have the same type of life goal" is a plausible position even just intuitively. (And I have some not-yet-written-out arguments why it seems clearly the correct stance to me, mostly based on my own example. I think about my life goals in a way that other clearthinking people wouldn't all want to replicate, and I'm confident that I'm not somehow confused about what I'm doing.)

Your case for SFE was intended to defend a view of population ethics - that there is an asymmetry between suffering and happiness. If we've decided that 'population ethics' is to remain undetermined, that is we adopt view 3 for population ethics, what is your argument (that SFE is an intuitively appealing explanation for many of our moral intuitions) meant to achieve? Can't I simply declare that my intuitions say different, and then we have nothing more to discuss, if we already know we're going to leave population ethics undetermined?

Exactly! :) That's why I called my sequence a sequence on moral anti-realism. I don't think suffering-focused ethics is "universally correct." The case for SFE is meant in the following way: As far as personal takes on population ethics go, SFE is a coherent attractor. It's a coherent and attractive morality-inspired life goal for people who want to devote some of their caring capacity to what happens to earth's future light cone.

Side note: This framing is also nice for cooperation. If you think in terms of all-encompassing moralities, SFE consequentialism and non-SFE consequentialism are in tension. But if population ethics is just a subdomain of ethics, then the tension is less threatening. Democrats and Republicans are also "in tension," worldview-wise, but many of them also care – or at least used to care – about obeying the norms of the overarching political process. Similarly, I think it would be good if EA moved toward viewing people with suffering-focused versus not-suffering-focused population ethics as "not more in tension than Democrats versus Republicans." This would be the natural stance if we started viewing population ethics as a morality-inspired subdomain of currently-existing people thinking about their life goals (particularly with respect to "what do we want to do with earth's future lightcone"). After you've chosen your life goals, that still leaves open the further question "How do you think about other people having different life goals from yours?" That's where preference utilitarianism comes in (if one takes a strong stance on how much to respect others' interests) or where we can refer to "norms of civil society" (weaker stance on respect; formalizable with contractualism that has a stronger action-omission distinction than preference utilitarianism). [Credit to Scott Alexander's archipelago blogpost for inspiring this idea. I think he also had a blogpost on "axiology" that made a similar point, but by that point I might have already found my current position.]

In any case, I'm considering changing all my framings from "moral anti-realism" to "morality is underdetermined." It seems like people understand me much faster if I use the latter framing, and in my head it's the same message.

---

As a rough summary, I think the most EA-relevant insights from my sequence (and comment discussions under the sequence posts) are the following:

1. Morality could be underdetermined

2. Moral uncertainty and confidence in strong moral realism are in tension

3. There is no absolute wager for moral realism

(Because assuming idealized reasoning conditions, all reflectively consistent moral opinions are made up of the same currency. That currency – "what we on reflection care about" – doesn't suddenly lose its significance if there's less convergence than we initially thought. Just like I shouldn't like the taste of cilantro less once I learn that it tastes like soap to many people, I also shouldn't care less about reducing future suffering if I learn that not everyone will find this the most meaningful thing they could do with their lives.)

4. Mistaken metaethics can lead to poorly grounded moral opinions

(Because people may confuse moral uncertainty with having underdetermined moral values, and because morality is not a coordination game where we try to guess what everyone else is trying to guess will be the answer everyone converges on.)

5. When it comes to moral questions, updating on peer disagreement doesn’t straightforwardly make sense

(Because it matters whether the peers share your most fundamental intuitions and whether they carve up the option space in the same way as you. Regarding the latter, someone who never even ponders the possibility of treating population ethics separately from the rest of ethics isn't reaching a different conclusion on the same task. Instead, they're doing a different task. I'm interested in all the three questions I dissolved ethics into, whereas people who play the game "pick your version of consequentialism and answer every broadly-morality-related question with that" are playing a different game. Obviously that framing is a bit of a strawman, but you get the point!)

I'm here from your comment on Lukas' post on the EA Forum. I haven't been following the realism vs anti-realism discussion closely, though, just kind of jumped in here when it popped up on the EA Forum front page.

Are there good independent arguments against the absurd conclusion? It's not obvious to me that it's bad. Its rejection is also so close to separability/additivity that for someone who's not sold on separability/additivity, an intuitive response is "Well ya, of course, so what?". It seems to me that the absurd conclusion is intuitively bad for some only because they have separable/additive intuitions in the first place, so it almost begs the question against those who don't.

So to (3), focussing on suffering-reduction and denying the absurd conclusion is fine, but this would not satisfy (1).

By deny, do you mean reject? Doesn't negative utilitarianism work? Or do you mean incorrectly denying that the absurd conclusion doesn't follow from diminishing returns to happiness vs suffering?

Also, for what it's worth, my view is that a symmetric preference consequentialism is the worst way to do preference consequentialism, and I recognize asymmetry as a general feature of ethics. See these comments:

[+][comment deleted]10

I think the mountain analogy really is the center of the rationality anti-realist argument.

It's very intuitive to think of us perceiving facts about e.g. epistemology as if gazing upon a mountain. There is a clean separation between us, the gazer, and that external mountain, which we perceive in a way that we can politely pretend is more or less directly. We receive rapid, rich data about it, through a sensory channel whose principles of operation we well understand and trust, and that data tends to cohere well together with everything else, except when sometimes it doesn't but let's not worry about that. Etc.

The rationality anti-realist position is that perceiving facts about epistemology is very little like looking at a mountain. I'm reminded of a Dennett quote about the quality of personal experience:

Just about every author who has written about consciousness has made what we might call the first-person-plural presumption: Whatever mysteries consciousness may hold, we (you, gentle reader, and I) may speak comfortably together about our mutual acquaintances, the things we both find in our streams of consciousness. And with a few obstreperous exceptions, readers have always gone along with the conspiracy.
This would be fine if it weren’t for the embarrassing fact that controversy and contradiction bedevil the claims made under these conditions of polite mutual agreement. We are fooling ourselves about something. Perhaps we are fooling ourselves about the extent to which we are all basically alike. Perhaps when people first encounter the different schools of thought on phenomenology, they join the school that sounds right to them, and each school of phenomenological description is basically right about its own members’ sorts of inner life, and then just innocently overgeneralizes, making unsupported claims about how it is with everyone.

So, the mountain disanalogy: sometimes there are things we have opinions about, and yet there is no clean separation between us and the thing. We don't perceive it in a way that we can agree is trusted or privileged. We receive vague, sparse data about it, and the subject is plagued by disagreement, self-doubt, and claims that other people are doing it all wrong.

This isn't to say that we should give up entirely, but it means that we might have to shift our expectations of what sort of explanation or justification we are "entitled" to. Everyone would absolutely love it if they could objectively dunk on all those other people who disagree with them, but it's probably going to turn out that a thorough explanation will sound more like "here's how things got the way they are" rather than "here's why you're right and everyone else is wrong."

So, the mountain disanalogy: sometimes there are things we have opinions about, and yet there is no clean separation between us and the thing. We don't perceive it in a way that we can agree is trusted or privileged. We receive vague, sparse data about it, and the subject is plagued by disagreement, self-doubt, and claims that other people are doing it all wrong.
This isn't to say that we should give up entirely, but it means that we might have to shift our expectations of what sort of explanation or justification we are "entitled" to.

So this depends on two things - first, how likely (in advance of assessing the 'evidence') something like normative realism is, and then how good that evidence is (how coherent it is). If we have really good reasons in advance to think there's 'no separation between us and the thing' then no matter how coherent the 'thing' is we have to conclude that while we might all be able to agree on what it is, it isn't mind independent.

So, is it coherent, and is it mind-independent? How coherent it needs to be for us to be confident we can know it, depends on how confident we are that its mind-independent, and vice versa.

The argument for coherence comes in the form of convergence (not among people, to be clear, but among normative frameworks), but as you say that doesn't establish its mind independent (it might give you some strong hint, though, if its really strongly consistent and coherent), and the argument that normativity is mind-independent comes from the normativity argument. These three posts deal with the difference between those two arguments and how strong they are, and how they interact:

Normative Anti-realism is self-defeating

Normativity and recursive justification

Prescriptive Anti-realism

https://blog.google/outreach-initiatives/public-policy/google-microsoft-openai-anthropic-frontier-model-forum/

Today, Anthropic, Google, Microsoft and OpenAI are announcing the formation of the Frontier Model Forum, a new industry body focused on ensuring safe and responsible development of frontier AI models. The Frontier Model Forum will draw on the technical and operational expertise of its member companies to benefit the entire AI ecosystem, such as through advancing technical evaluations and benchmarks, and developing a public library of solutions to support industry best practices and standards.

The core objectives for the Forum are:

  1. Advancing AI safety research to promote responsible development of frontier models, minimize risks, and enable independent, standardized evaluations of capabilities and safety.
  2. Identifying best practices for the responsible development and deployment of frontier models, helping the public understand the nature, capabilities, limitations, and impact of the technology.
  3. Collaborating with policymakers, academics, civil society and companies to share knowledge about trust and safety risks.
  4. Supporting efforts to develop applications that can help meet society’s greatest challenges, such as climate change mitigation and adaptation, early cancer detection and prevention, and combating cyber threats.

This seems overall very good at first glance, and then seems much better once I realized that Meta is not on the list. There's nothing here that I'd call substantial capabilities acceleration (i.e. attempts to collaborate on building larger and larger foundation models, though some of this could be construed as making foundation models more useful for specific tasks). Sharing safety-capabilities research like better oversight or CAI techniques is plausibly strongly net positive even if the techniques don't scale indefinitely. By the same logic, while this by itself is nowhere near sufficient to get us AI existential safety if alignment is very hard (and could increase complacency), it's still a big step in the right direction.

adversarial robustness, mechanistic interpretability, scalable oversight, independent research access, emergent behaviors and anomaly detection. There will be a strong focus initially on developing and sharing a public library of technical evaluations and benchmarks for frontier AI models.

The mention of combating cyber threats is also a step towards explicit pTAI

BUT, crucially, because Meta is frozen out we can know both that this partnership isn't toothless, represents a commitment to not do the most risky and antisocial things Meta presumably doesn't want to give up, and the fact that they're the only major AI company in the US to not join will be horrible PR for them as well. 

Nuclear Energy: Gradualism vs Catastrophism

catastrophists: when evolution was gradually improving hominid brains, suddenly something clicked - it stumbled upon the core of general reasoning - and hominids went from banana classifiers to spaceship builders. hence we should expect a similar (but much sharper, given the process speeds) discontinuity with AI.

gradualists: no, there was no discontinuity with hominids per se; human brains merely reached a threshold that enabled cultural accumulation (and in a meaningul sense it was culture that built those spaceships). similarly, we should not expect sudden discontinuities with AI per se, just an accelerating (and possibly unfavorable to humans) cultural changes as human contributions will be automated away.

I found the extended Fire/Nuclear Weapons analogy to be quite helpful. Here's how I think it goes:

In 1870 the gradualist and the catastrophist physicist wonder about whether there will ever be a discontinuity in explosive power

  • Gradualist: we've already had our zero-to-one discontinuity - we've invented black powder, dynamite and fuses, from now on there'll be incremental changes and inventions that increase explosive power but probably not anything qualitatively new, because that's our default expectation with a technology like explosives where there are lots of paths to improvement and lots of effort exerted
  • Catastrophist: that's all fine and good, but those priors don't mean anything if we have already seen an existence proof for qualitatively new energy sources. What about the sun? The energy the sun outputs is overwhelming, enough to warm the entire earth. One day, we'll discover how to release those energies ourselves, and that will give us qualitatively better explosives.
  • Gradualist: But we don't know anything about how the sun works! It's probably just be a giant ball of gas heated by gravitational collapse! One day, in some crazy distant future, we might be able to pile on enough gas that gravity implodes and heats it, but that'll require us to be able to literally build stars, it's not going to occur suddenly. We'll pile up a small amount of gas, then a larger amount, and so on after we've given up on assembling bigger and bigger piles of explosives. There's no secret physics there, just a lot of conventional gravitational and chemical energy in one place
  • Catastrophist: ah, but don't you know Lord Kelvin calculated the Sun could only shine for a few million years under the gravitational mechanism, and we know the Earth is far older than that? So there has to be some other, incredibly powerful energy source that we've not yet discovered within the sun. And when we do discover it, we know it can under the right circumstances, release enough energy to power the Sun, so it seems foolhardy to assume it'll just happen to be as powerful as our best normal explosive technologies are whenever we make the discovery. Imagine the coincidence if that was true! So I can't say when this will happen or even exactly how powerful it'll be, but when we discover the Sun's power it will probably represent a qualitatively more powerful new energy source. Even if there are many ways to try to tweak our best chemical explosives to be more powerful and/or the potential new sun-power explosives to be weaker, and we'd still not hit the narrow target of the two being roughly on the same level.
  • Gradualist: Your logic works, but I doubt Lord Kelvin's calculation

 

It seems like the AGI Gradualist sees the example of humans like my imagined Nukes Gradualist sees the sun, i.e. just a scale up of what we have now. While the AGI Catastrophist sees Humans as my imagined Nukes Catastrophist sees the sun.

The key disanalogy is that for the Sun case, there's a very clear 'impossibility proof' given by the Nukes Catastrophist that the sun couldn't just be a scale up of existing chemical and gravitational energy sources.

Modelling the Human Trajectory or ‘How I learned to stop worrying and love Hegel’.

Rohin’s opinion: I enjoyed this post; it gave me a visceral sense for what hyperbolic models with noise look like (see the blog post for this, the summary doesn’t capture it). Overall, I think my takeaway is that the picture used in AI risk of explosive growth is in fact plausible, despite how crazy it initially sounds.

One thing this post led me to consider is that when we bring together various fields, the evidence for 'things will go insane in the next century' is stronger than any specific claim about (for example) AI takeoff. What is the other evidence?

We're probably alone in the universe, and anthropic arguments tend to imply we're living at an incredibly unusual time in history. Isn't that what you'd expect to see in the same world where there is a totally plausible mechanism that could carry us a long way up this line, in the form of AGI and eternity in six hours? All the pieces are already there, and they only need to be approximately right for our lifetimes to be far weirder than those of people who were e.g. born in 1896 and lived to 1947 - which was weird enough, but that should be your minimum expectation.

In general, there are three categories of evidence that things are likely to become very weird over the next century, or that we live at the hinge of history:

  1. Specific mechanisms around AGI - possibility of rapid capability gain, and arguments from exploratory engineering

  2. Economic and technological trend-fitting predicting explosive growth in the next century

  3. Anthropic and Fermi arguments suggesting that we live at some extremely unusual time

All of these are evidence for such a claim. 1) is because a superintelligent AGI takeoff is just a specific example for how the hinge occurs. 3) is already directly arguing for that, but how does 2) fit in with 1) and 3)?

There is something a little strange about calling a fast takeoff from AGI and whatever was driving superexponential growth throughout all history the same trend - there is some huge cosmic coincidence that causes there to always be superexponential growth - so as soon as population growth + growth in wealth per capita or whatever was driving it until now runs out in the great stagnation (which is visible as a tiny blip on the RHS of the double-log plot), AGI takes over and pushes us up the same trend line. That's clearly not possible, so there would have to be some factor responsible for both if AGI is what takes us up the rest of that trend line - a factor that was at work in the founding of Jericho but predestined that AGI would be invented and cause explosive growth in the 21st century, rather than the 19th or the 23rd.

For AGI to be the driver of the rest of that growth curve, there has to be a single causal mechanism that keeps us on the same trend and includes AGI as its final step - if we say we are agnostic about what that mechanism is, we can still call 2) evidence for us living at the hinge point, though we have to note that there is a huge blank spot in need of explanation. Is there anything that can fill it to complete the picture?

The mechanism proposed in the article seems like it could plausibly include AGI.

If technology is responsible for the growth rate, then reinvesting production in technology will cause the growth rate to be faster. I'd be curious to see data on what fraction of GWP gets reinvested in improved technology and how that lines up with the other trends.

But even though the drivers seem superficially similar - they are both about technology, the claim is that one very specific technology will generate explosive growth, not that technology in general will - it seems strange that AGI would follow the same growth curve caused by reinvesting more GWP in improving ordinary technology which doesn't improve your own ability to think in the same way that AGI would.

As for precise timings, the great stagnation (last 30ish years) just seems like it would stretch out the timeline a bit, so we shouldn't take the 2050s seriously - as much as the last 70 years work on an exponential trend line there's really no way to make it fit overall as that post makes clear.

Improving preference learning approaches

When examining value learning approaches to AI Alignment, we run into two classes of problem - we want to understand how to elicit preferences, which is (even theoretically, with infinite computing power), very difficult, and we want to know how to go about aggregating preferences stably and correctly which is not just difficult but runs into complicated social choice and normative ethical issues.

Many research programs say the second of these questions is less important than the first, especially if we expect continuous takeoff with many chances to course-correct, and a low likelihood of an AI singleton with decisive strategic advantage. For many, building an AI that can reliably extract and pursue the preferences of one person is good enough.

Christiano calls this 'the narrow approach' and sees it as a way to sidestep many of the ethical issues, including those around social choice ethics. Those would be the 'ambitious' approaches.

We want to build machines that helps us do the things we want to do, and to that end they need to be able to understand what we are trying to do and what instrumental values guide our behavior. To the extent that our “preferences” are underdetermined or inconsistent, we are happy if our systems at least do as well as a human, and make the kinds of improvements that humans would reliably consider improvements.
But it’s not clear that anything short of the maximally ambitious approach can solve the problem we ultimately care about.

I think that the ambitious approach is still worth investigating, because it may well eventually need to be solved, and also because it may well need to be addressed in a more limited form even on the narrow approach (one could imagine an AGI with a lot of autonomy having to trade-off the preferences of, say, a hundred different people). But even the 'narrow' approach raises difficult psychological issues about how to distinguish legitimate preferences from bias - questions of elicitation. In other words, the cognitive science issues around elicitation (distinguishing bias from legitimate preference) must be resolved for any kind of preference learning to work, and the social choice and ethical issues around preference aggregation need at least preliminary solutions for any alignment method that aims to apply to more than one person (even if final, provably correct solutions to aggregation are only needed if designing a singleton with decisive strategic advantage).

I believe that I've located two areas that are under- or unexplored, for improving the ability of reward modelling approaches to elicit human preferences and to aggregate human preferences. These are: using multiple information sources from a human (approval and actions) which diverge to help extract unbiased preferences, and using RL proxy agents in iterated voting to reach consensus preference aggregations, rather than some direct statistical method. Neither of these is a complete solution, of course, for reasons discussed e.g. here by Stuart Armstrong, but they could nonetheless help.

Improving preference elicitation: multiple information sources

Eliciting the unbiased preferences of an individual human is extremely difficult, for reasons given here.

The agent's actions can be explained by their beliefs and preferences[1], and by their biases: by this, we mean the way in which the action selector differs from an unboundedly rational expected preference maximiser.
The results of the Occam's razor paper imply that preferences (and beliefs, and biases) cannot be deduced separately from knowing the agent's policy (and hence, a fortiori, from any observations of the agent's behaviour).

...

To get around the impossibility result, we need "normative assumptions": assumptions about the preferences (or beliefs, or biases) of the agent that cannot be deduced fully from observations.
Under the optimistic scenario, we don't need many of these, at least for identifying human preferences. We can label a few examples ("the anchoring bias, as illustrated in this scenario, is a bias"; "people are at least weakly rational"; "humans often don't think about new courses of action they've never seen before", etc...). Call this labelled data[2] D.
The algorithm now constructs categories preferences*, beliefs*, and biases* - these are the generalisations that it has achieved from D

Yes, even on the 'optimistic scenario' we need external information of various kinds to 'debias'. However, this external information can come from a human interacting with the AI, in the form of human approval of trajectories or actions taken or proposed by an AI agent, on the assumption that since our stated and revealed preferences diverge, there will sometimes be differences in what we approve of and what we do that are due solely to differences in bias.

This is still technically external to observing the human's behaviour, but it is essentially a second input channel for information about human preferences and biases. This only works, of course, if humans tend to approve different things to the things that they actually do in a way influenced by bias (otherwise you have the same information as you'd get from actions, which helps with improving accuracy but not debiasing, see here), which is the case at least some of the time.

In other words, the beliefs and preferences are unchanged when the agent acts or approves but the 'approval selector' is different from the 'action selector' sometimes and, based on what does and does not change, you can try to infer what originated from legitimate beliefs and preferences and what originated from variation between the approval and action selector, which must be bias.

So, for example, if we conducted a principle component analysis on π, we would expect that the components would all be mixes of preferences/beliefs/biases.

So a PCA performed on the approval would produce a mix of beliefs, preferences and (different) biases. Underlying preferences are, by specification, equally represented either by human actions or by human approval of actions taken (since no matter what they are your preferences), but many biases don't exhibit this pattern - for example, we discount more over time in our revealed preferences than in our stated preferences. What we approve of typically represents a less (or at least differently) biased response than what we actually do.

There has already been research on combining information on reward models from multiple sources, to infer a better overall reward model but not as far as I know on specifically actions and approval as differently biased sources of information.

CIRL ought to extract our revealed preferences (since it's based on behavioural policy) while a method like reinforcement learning from human preferences should extract our stated preferences - that might be a place to start, at least on validating that there actually are relevant differences caused by differently strong biases in our stated vs revealed preferences, and that the methods actually do end up with different policies.

The goal here would be to have some kind of 'dual channel' preference learner that extracts beliefs and preferences from biased actions and approval by examining what varies. I'm sure you'd still need labelling and explicit information about what counts as a bias, but there might need to be a lot less than with single information sources. How much less (how much extra information you get from such divergences) seems like an empirical question. Finding out how common divergences between stated and revealed preferences that actually influence the learned policies of agents designed to infer human preferences from actions vs approval are would be useful as a first step. Stuart Armstrong:

In the pessimistic scenario, human preferences, biases, and beliefs are twisted together is a far more complicated way, and cannot be separated by a few examples.
In contrast, take examples of racial bias, hindsight bias, illusion of control, or naive realism. These biases all seem to be of quite different from the anchoring bias, and quite different from each other. At the very least, they seem to be of different "type signature".
So, under the pessimistic scenario, some biases are much closer to preferences that generic biases (and generic preferences) are to each other.

What I've suggested should still help at least somewhat in the pessimistic scenario - unless preferences/beliefs vary when you switch between looking at approval vs actions more than biases vary, you can still gain some information on underlying preferences and beliefs by seeing how approval and actions differ.

Of the difficult examples you gave, racial bias at least varies between actions and approval. Implementing different reward modelling algorithms and messing around with them to try and find ways to extract unbiased preferences from multiple information sources might be a useful research agenda.

There has already been research done on using multiple information sources to improve the accuracy of preference learning - Reward-rational implicit choice, but not specifically on using the divergences between different sources of information from the same agent to learn things about the agents unbiased preferences.

Improving preference aggregation: iterated voting games

In part because of arguments like these, there has been less focus on the aggregation side of things than on the direct preference learning side.

Christiano says of methods like CEV, which aim to extrapolate what I ‘really want’ far beyond what my current preferences are; ‘most practitioners don’t think of this problem even as a long-term research goal — it’s a qualitatively different project without direct relevance to the kinds of problems they want to solve’. This is effectively a statement of the Well-definedness consideration when sorting through value definitions - our long-term ‘coherent’ or ‘true’ preferences currently aren’t well understood enough to guide research so we need to restrict ourselves to more direct normativity - extracting the actual preferences of existing humans

However, I think that it is important to get on the right track early - even if we never have cause to build a powerful singleton AI that has to aggregate all the preferences of humanity, there will still probably be smaller-scale situations where the preferences of several people need to be aggregated or traded-off. Shifting a human preference learner from a single to a small group of human preferences could produce erroneous results due to distributional shift, potentially causing alignment failures, so even if we aren't trying for maximally ambitious value learning it might still be worth investigating preference aggregation.

There has been some research done on preference aggregation for AIs learning human values, specifically in the context of Kidney exchanges:

We performed statistical modeling of participants’ pairwise comparisons between patient profiles in order to obtain weights for each profile. We used the Bradley-Terry model, which treats each pairwise comparison as a contest between a pair of players
We have shown one way in which moral judgments can be elicited from human subjects, how those judgments can be statistically modelled, and how the results can be incorporated into the algorithm. We have also shown, through simulations, what the likely effects of deploying such a prioritization system would be, namely that under demanded pairs would be significantly impacted but little would change for others. We do not make any judgment about whether this conclusion speaks in favor of or against such prioritization, but expect the conclusion to be robust to changes in the prioritization such as those that would result from a more thorough process, as described in the previous paragraph.

The Kidney exchange paper elicited preferences from human subjects (using repeated pairwise comparisons) and then aggregated them using the Bradley-Terry model. You couldn't use such a simple statistical method to aggregate quantitative preferences over continuous action spaces, like the preferences that would be learned from a human via a complex reward model. Also, any time you try to use some specific one-shot voting mechanism you run into various impossibility theorems which seem to force you to give up some desirable property.

One approach that may be more robust against errors in a voting mechanism, and easily scalable to more complex preference profiles is to use RL not just for the preference elicitation, but also for the preference aggregation. The idea is that we embrace the inevitable impossibility results (such as Arrow and GS theorems) and consider agents' ability to vote strategically as an opportunity to reach stable outcomes. 

This paper uses very simple Q-learning agents with a few different policies - epsilon-greedy, greedy and upper confidence bound, in an iterated voting game, and gets behaviour that seems sensible. (Note the similarity and differences with the moral parliament, where a particular one-shot voting rule is justified a priori and then used.)

The fact that this paper exists is a good sign because it's very recent and the methods it uses are very simple - it's pretty much just a proof of concept, as the authors state - so that tells me there's a lot of room for combining more sophisticated RL with better voting methods.

Combining elicitation and aggregation

Having elicited preferences from each individual human (using methods like those above to 'debias'), we obtain a proxy agent representing each individual's preferences. Then these agents can be placed into an iterated voting situation until a convergent answer is reached.

That seems like the closest practical approximation to a CEV of a group of people that could be constructed with anything close to current methods - a pipeline from observed behaviour and elicited approval to a final aggregated decision about what to do based on overall preferences. Since its a value learning framework that's extendible over any size group, which is somewhat indirect, you might call it a Coherent Extrapolated Framework (CEF) as I suggested last year.

So to sum up, a very high-level summary of the steps in this method of preference elicitation and aggregation would be:

    1. With a mixture of normative assumptions and multi-channel information (approval and actions) as inputs, use a reward-modelling method to elicit the debiased preferences of many individuals.
      1. Determining whether there actually are significant differences between stated and revealed preferences when performing reward modelling is the first step to using multi-channel information to effectively separate biases from preferences.
    2. Create 'proxy agents' using the reward model developed for each human (this step is where intent-aligned amplification can potentially occur).
    3. Place the proxies in an iterated voting situation which tends to produce sensible convergent results. The use of RL proxies here can be compared to the use of human proxies in liquid democracy.
      1. Which voting mechanisms tend to work in iterated situations with RL agents can be determined in other experiments (probably with purely artificial agents)
    4. Run the voting mechanism until an unambiguous winner is decided, using methods like those given in this paper.

This seems like a reasonable procedure for extending a method that is aligned to one human's preferences (step 1,2) to produce sensible results when trying to align to an aggregate of human preferences (step 3,4). It reduces reliance on the specific features of one voting method, Other than the insight that multiple channels of information might help, all the standard unsolved problems with preference learning from one human remain.

Even though we can't yet align an AGI to one human's preferences, trying to think about how to aggregate human preferences in a way that is scalable isn't premature, as has sometimes been claimed.

In many 'non-ambitious' hypothetical settings where we aren't trying to build an AGI sovereign over the whole world (for example, designing a powerful AI to govern the operations of a hospital), we still need to be able to aggregate preferences sensibly and stably. This method would do well at such intermediate scales, as it doesn't approach the question of preference aggregation from a 'final' ambitious value-learning perspective but instead tries to look at preference aggregation the same way we look at elicitation, with an RL-based iterative approach to reaching a result.

However, if you did want to use such a method to try and produce the fabled 'final utility function of all humanity', it might not give you Humanity's CEV, since some normative assumptions (preferences count equally and in the way given by the voting mechanism), are built in. By analogy with CEV, I called the idealized result of this method a coherent extrapolated framework (CEF). This is a more normatively direct method of aggregating values than CEV, (since you fix a particular method of aggregating preferences in advance), as it extrapolates from a voting framework rather than extrapolating based on our volition, more broadly (and vaguely) defined, hence CEF.

I think that the notion of Simulacra Levels is both useful and important, especially when we incorporate Harry Frankfurt's idea of Bullshit

Harry Frankfurt's On Bullshit seems relevant here. I think its worth trying to incorporate Frankfurt's definition as well, as it is quite widely known, see e.g. this video - If you were to do so, I think you would say that on Frankfurt's definition, Level 1 tells the truth, Level 2 lies, Level 3 bullshits about physical facts but will lie or tell the truth about things in the social realm (e.g. others motives, your own affiliation), and Level 4 always bullshits.

How do we distinguish lying from bullshit? I worry that there is a tendency to adopt self-justifying signalling explanations, where an internally complicated signalling explanation that's hard to distinguish from a simpler 'lying' explanation, gets accepted, not because it's a better explanation overall but just because it has a ready answer to any objections. If 'Social cognition has been the main focus of Rationality' is true, then we need to be careful to avoid overusing such explanations. Stefan Schubert explains how this can end up happening:

...

It seems to me that it’s pretty common that signalling explanations are unsatisfactory. They’re often logically complex, and it’s tricky to identify exactly what evidence is needed to demonstrate them.

And yet even unsatisfactory signalling explanations are often popular, especially with a certain crowd. It feels like you’re removing the scales from our eyes; like you’re letting us see our true selves, warts and all. And I worry that this feels a bit too good to some: that they forget about checking the details of how the signalling explanations are supposed to work. Thus they devise just-so stories, or fall for them.

This sort of signalling paradigm also has an in-built self-defence, in that critics are suspected of hypocrisy or naïveté. They lack the intellectual honesty that you need to see the world for what it really is, the thinking goes

Update to 'Modelling Continuous Progress'

I made an attempt to model intelligence explosion dynamics in this post, by attempting to make the very oversimplified exponential-returns-to-exponentially-increasing-intelligence model used by Bostrom and Yudkowsky slightly less oversimplified.

This post tries to build on a simplified mathematical model of takeoff which was first put forward by Eliezer Yudkowsky and then refined by Bostrom in Superintelligence, modifying it to account for the different assumptions behind continuous, fast progress as opposed to discontinuous progress. As far as I can tell, few people have touched these sorts of simple models since the early 2010’s, and no-one has tried to formalize how newer notions of continuous takeoff fit into them. I find that it is surprisingly easy to accommodate continuous progress and that the results are intuitive and fit with what has already been said qualitatively about continuous progress.

The page includes python code for the model.

This post doesn't capture all the views of takeoff - in particular it doesn't capture the non-hyperbolic faster growth mode scenario, where marginal intelligence improvements are exponentially increasingly difficult, and therefore we get a (continuous or discontinuous switch to a) new exponential growth mode rather than runaway hyperbolic growth.

But I think that by modifying the f(I) function that determines how RSI capability varies with intelligence we can incorporate such views.

In the context of the exponential model given in the post that would correspond to an f(I) function where

which would result in a continuous (determined by size of d) switch to a single faster exponential growth mode

But I think the model still roughly captures the intuition behind scenarios that involve either a continuous or a discontinuous step to an intelligence explosion.

Given the model assumptions, we see how the different scenarios look in practice:

If we plot potential AI capability over time, we can see how no new growth mode (brown) vs a new growth mode (all the rest), the presence of an intelligence explosion (red and orange) vs not (green and purple), and the presence of a discontinuity (red and purple) vs not (orange and green) affect the takeoff trajectory.

[-][anonymous]20

This also depends on what you mean by capability, correct?  Today we have computers that are millions of times faster but only logarithmically more capable.  No matter the topic you get diminishing returns with more capability.

Moreover, if you talk about the AI building 'rubber hits the road' real equipment to do things - real actual utility versus ability to think about things - the AI is up against things like hard limits to thermodynamics and heat dispersion and so on.

So while the actual real world results could be immense - swarms of robotic systems tearing down all the solid matter in our solar system - the machine is still very much bounded by what physics will permit, and so the graph is only vertical for a brief period of time.  (the period between 'technology marginally better  than present day' and 'can tear down planets with the click of a button')

Yes, its very oversimplified - in this case 'capability' just refers to whatever enables RSI, and we assume that it's a single dimension. Of course, it isn't, but we assume that the capability can be modelled this way as a very rough approximation.

Physical limits are another thing the model doesn't cover - you're right to point out that on the intelligence explosion/full RSI scenarios the graph goes vertical only for a time until some limit is hit

Tom Davidson’s report: https://docs.google.com/document/d/1rw1pTbLi2brrEP0DcsZMAVhlKp6TKGKNUSFRkkdP_hs/edit?usp=drivesdk

My old 2020 post: https://www.lesswrong.com/posts/66FKFkWAugS8diydF/modelling-continuous-progress

In my analysis of Tom Davidson's "Takeoff Speeds Report," I found that the dynamics of AI capability improvement, as discussed in the context of a software-only singularity, align closely with the original simplified equation ( I′(t) = cI + f(I)I^2 ) from my 4 year old post on Modeling continuous progress. Essentially, that post describes how we switch from exponential to hyperbolic growth as the fraction of AI research done by AIs improves along a logistic curve. These are all features of the far more complex mathematical model in tom’s report.

In this equation, ( I ) represents the intelligence or capability of the AI system. It correlates to the cognitive output or the efficiency of the AI as described in the report, where the focus is on the software improvements contributing to the overall effectiveness of AI systems. The term ( cI ) in the equation can be likened to the constant external effort or input in improving AI systems, which is consistent with the ongoing research and development efforts mentioned in the report. This part of the equation represents the incremental improvements in AI capabilities due to human-led development efforts.

The second term in the equation, ( f(I)I^2 ), is particularly significant in understanding the relationship with the software-only singularity concept. Here, ( f(I) ) is a function that determines the extent to which the AI system can use its intelligence to improve itself, essentially a measure of recursive self-improvement (RSI). In the report, the discussion about a software-only singularity uses a similar concept, where the AI systems reach a point where their self-improvement significantly accelerates their capability growth. This is analogous to ( f(I) ) increasing, leading to a more substantial contribution of ( I^2 ) (the AI’s self-improvement efforts) to the overall rate of intelligence growth, ( I′(t) ). As the AI systems become more capable, they contribute more to their own development, a dynamic that both the equation and the report capture. The report has a ‘FLOP gap' from when AIs start to contribute to research at all to when they fully take over which is essentially the upper and lower bounds to fit the f(I) curve to. Otherwise, the overall rate of change is sharper in tom’s report as I ignored increasing investment and increasing compute in my model focussing only on software self improvement feedback loops.

One other thing I liked about Tom’s report is it's focus on relatively outside viewy bio anchors and epoch AIs direct approach estimates for what is needed for TAI.

Maybe this is an unreasonable demand, but one concern I have about all of these alleged attempts to measure the ability of an AI to automate scientific research, is that this feels like a situation where it's unusually slippery and unusually easy to devise a metric that doesn't actually capture what's needed to dramatically accelerate research and development. Ideally, I'd like a metric where we know, as a matter of necessity, that a very high score means that the system would be able to considerably speed up research.

For example, the direct approach estimation does have this property, where if you can replicate to a certain level of accuracy what a human expert would say over a certain horizon length, you do in some sense have to be able to match or replicate the underlying thinking that produced it, which means being able to do long horizon tasks, but of course, that's a very vague upper bound. It's not perfect, the Horizon Length metric might only cover the 90th percentile of tasks at each time scale. The remaining 10th percentile might contain harder, more important tasks necessary for AI progress

I think trying to anticipate and list in a task all the capabilities you think you need to automate scientific progress when we don't really know what those are will lead to a predictable underestimate of what's required.