Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
I came across a 2015 blog post by Vitalik Buterin that contains some ideas similar to Paul Christiano's recent Crowdsourcing moderation without sacrificing quality. The basic idea in both is that it would be nice to have a panel of trusted moderators carefully pore over every comment and decide on its quality, but since that is too expensive, we can instead use some tools to predict moderator decisions, and have the trusted moderators look at only a small subset of comments in order to calibrate the prediction tools. In Paul's proposal the prediction tool is machine learning (mainly using individual votes as features), and in Vitalik's proposal it's prediction markets where people bet on what the moderators would decide if they were to review each comment.
It seems worth thinking about how to combine the two proposals to get the best of both worlds. One fairly obvious idea is to let people both vote on comments as an expression of their own opinions, and also place bets about moderator decisions, and use ML to set baseline odds, which would reduce how much the forum would have to pay out to incentivize accurate prediction markets. The hoped for outcome is that the ML algorithm would make correct decisions most of the time, but people can bet against it when they see it making mistakes, and moderators would review comments that have the greatest disagreements between ML and people or between different bettors in general. Another part of Vitalik's proposal is that each commenter has to make an initial bet that moderators would decide that their comment is good. The article notes that such a bet can also be viewed as a refundable deposit. Such forced bets / refundable deposits would help solve a security problem with Paul's ML-based proposal.
Are there better ways to combine these prediction tools to help with forum moderation? Are there other prediction tools that can be used instead or in addition to these?
Some of you may already have seen this story, since it's several days old, but MIT Technology Review seems to have the best explanation of what happened: Why and How Baidu Cheated an Artificial Intelligence Test
Such is the success of deep learning on this particular test that even a small advantage could make a difference. Baidu had reported it achieved an error rate of only 4.58 percent, beating the previous best of 4.82 percent, reported by Google in March. In fact, some experts have noted that the small margins of victory in the race to get better on this particular test make it increasingly meaningless. That Baidu and others continue to trumpet their results all the same - and may even be willing to break the rules - suggest that being the best at machine learning matters to them very much indeed.
(In case you didn't know, Baidu is the largest search engine in China, with a market cap of $72B, compared to Google's $370B.)
The problem I see here is that the mainstream AI / machine learning community measures progress mainly by this kind of contest. Researchers are incentivized to use whatever method they can find or invent to gain a few tenths of a percent in some contest, which allows them to claim progress at an AI task and publish a paper. Even as the AI safety / control / Friendliness field gets more attention and funding, it seems easy to foresee a future where mainstream AI researchers continue to ignore such work because it does not contribute to the tenths of a percent that they are seeking but instead can only hinder their efforts. What can be done to change this?
In the not too distant past, people thought that our universe might be capable of supporting an unlimited amount of computation. Today our best guess at the cosmology of our universe is that it stops being able to support any kind of life or deliberate computation after a finite amount of time, during which only a finite amount of computation can be done (on the order of something like 10^120 operations).
Consider two hypothetical people, Tom, a total utilitarian with a near zero discount rate, and Eve, an egoist with a relatively high discount rate, a few years ago when they thought there was .5 probability the universe could support doing at least 3^^^3 ops and .5 probability the universe could only support 10^120 ops. (These numbers are obviously made up for convenience and illustration.) It would have been mutually beneficial for these two people to make a deal: if it turns out that the universe can only support 10^120 ops, then Tom will give everything he owns to Eve, which happens to be $1 million, but if it turns out the universe can support 3^^^3 ops, then Eve will give $100,000 to Tom. (This may seem like a lopsided deal, but Tom is happy to take it since the potential utility of a universe that can do 3^^^3 ops is so great for him that he really wants any additional resources he can get in order to help increase the probability of a positive Singularity in that universe.)
You and I are not total utilitarians or egoists, but instead are people with moral uncertainty. Nick Bostrom and Toby Ord proposed the Parliamentary Model for dealing with moral uncertainty, which works as follows:
Suppose that you have a set of mutually exclusive moral theories, and that you assign each of these some probability. Now imagine that each of these theories gets to send some number of delegates to The Parliament. The number of delegates each theory gets to send is proportional to the probability of the theory. Then the delegates bargain with one another for support on various issues; and the Parliament reaches a decision by the delegates voting. What you should do is act according to the decisions of this imaginary Parliament.
It occurred to me recently that in such a Parliament, the delegates would makes deals similar to the one between Tom and Eve above, where they would trade their votes/support in one kind of universe for votes/support in another kind of universe. If I had a Moral Parliament active back when I thought there was a good chance the universe could support unlimited computation, all the delegates that really care about astronomical waste would have traded away their votes in the kind of universe where we actually seem to live for votes in universes with a lot more potential astronomical waste. So today my Moral Parliament would be effectively controlled by delegates that care little about astronomical waste.
Or to ask the question another way, is there such a thing as a theory of bounded rationality, and if so, is it the same thing as a theory of general intelligence?
The LW Wiki defines general intelligence as "ability to efficiently achieve goals in a wide range of domains", while instrumental rationality is defined as "the art of choosing and implementing actions that steer the future toward outcomes ranked higher in one's preferences". These definitions seem to suggest that rationality and intelligence are fundamentally the same concept.
However, rationality and AI have separate research communities. This seems to be mainly for historical reasons, because people studying rationality started with theories of unbounded rationality (i.e., with logical omniscience or access to unlimited computing resources), whereas AI researchers started off trying to achieve modest goals in narrow domains with very limited computing resources. However rationality researchers are trying to find theories of bounded rationality, while people working on AI are trying to achieve more general goals with access to greater amounts of computing power, so the distinction may disappear if the two sides end up meeting in the middle.
We also distinguish between rationality and intelligence when talking about humans. I understand the former as the ability of someone to overcome various biases, which seems to consist of a set of skills that can be learned, while the latter is a kind of mental firepower measured by IQ tests. This seems to suggest another possibility. Maybe (as Robin Hanson recently argued on his blog) there is no such thing as a simple theory of how to optimally achieve arbitrary goals using limited computing power. In this view, general intelligence requires cooperation between many specialized modules containing domain specific knowledge, so "rationality" would just be one module amongst many, which tries to find and correct systematic deviations from ideal (unbounded) rationality caused by the other modules.
I was more confused when I started writing this post, but now I seem to have largely answered my own question (modulo the uncertainty about the nature of intelligence mentioned above). However I'm still interested to know how others would answer it. Do we have the same understanding of what "rationality" and "intelligence" mean, and know what distinction someone is trying to draw when they use one of these words instead of the other?
ETA: To clarify, I'm asking about the difference between general intelligence and rationality as theoretical concepts that apply to all agents. Human rationality vs intelligence may give us a clue to that answer, but isn't the main thing that I'm interested here.
In this post, I list six metaethical possibilities that I think are plausible, along with some arguments or plausible stories about how/why they might be true, where that's not obvious. A lot of people seem fairly certain in their metaethical views, but I'm not and I want to convey my uncertainty as well as some of the reasons for it.
- Most intelligent beings in the multiverse share similar preferences. This came about because there are facts about what preferences one should have, just like there exist facts about what decision theory one should use or what prior one should have, and species that manage to build intergalactic civilizations (or the equivalent in other universes) tend to discover all of these facts. There are occasional paperclip maximizers that arise, but they are a relatively minor presence or tend to be taken over by more sophisticated minds.
- Facts about what everyone should value exist, and most intelligent beings have a part of their mind that can discover moral facts and find them motivating, but those parts don't have full control over their actions. These beings eventually build or become rational agents with values that represent compromises between different parts of their minds, so most intelligent beings end up having shared moral values along with idiosyncratic values.
- There aren't facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with ontological crises. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
- None of the above facts exist, so the only way to become or build a rational agent is to just think about what preferences you want your future self or your agent to hold, until you make up your mind in some way that depends on your psychology. But at least this process of reflection is convergent at the individual level so each person can reasonably call the preferences that they endorse after reaching reflective equilibrium their morality or real values.
- None of the above facts exist, and reflecting on what one wants turns out to be a divergent process (e.g., it's highly sensitive to initial conditions, like whether or not you drank a cup of coffee before you started, or to the order in which you happen to encounter philosophical arguments). There are still facts about rationality, so at least agents that are already rational can call their utility functions (or the equivalent of utility functions in whatever decision theory ends up being the right one) their real values.
- There aren't any normative facts at all, including facts about what is rational. For example, it turns out there is no one decision theory that does better than every other decision theory in every situation, and there is no obvious or widely-agreed-upon way to determine which one "wins" overall.
(Note that for the purposes of this post, I'm concentrating on morality in the axiological sense (what one should value) rather than in the sense of cooperation and compromise. So alternative 1, for example, is not intended to include the possibility that most intelligent beings end up merging their preferences through some kind of grand acausal bargain.)
It may be useful to classify these possibilities using labels from academic philosophy. Here's my attempt: 1. realist + internalist 2. realist + externalist 3. relativist 4. subjectivist 5. moral anti-realist 6. normative anti-realist. (A lot of debates in metaethics concern the meaning of ordinary moral language, for example whether they refer to facts or merely express attitudes. I mostly ignore such debates in the above list, because it's not clear what implications they have for the questions that I care about.)
One question LWers may have is, where does Eliezer's metathics fall into this schema? Eliezer says that there are moral facts about what values every intelligence in the multiverse should have, but only humans are likely to discover these facts and be motivated by them. To me, Eliezer's use of language is counterintuitive, and since it seems plausible that there are facts about what everyone should value (or how each person should translate their non-preferences into preferences) that most intelligent beings can discover and be at least somewhat motivated by, I'm reserving the phrase "moral facts" for these. In my language, I think 3 or maybe 4 is probably closest to Eliezer's position.
In early 2000, I registered my personal domain name weidai.com, along with a couple others, because I was worried that the small (sole-proprietor) ISP I was using would go out of business one day and break all the links on the web to the articles and software that I had published on my "home page" under its domain. Several years ago I started getting offers, asking me to sell the domain, and now they're coming in almost every day. A couple of days ago I saw the first six figure offer ($100,000).
In early 2009, someone named Satoshi Nakamoto emailed me personally with an announcement that he had published version 0.1 of Bitcoin. I didn't pay much attention at the time (I was more interested in Less Wrong than Cypherpunks at that point), but then in early 2011 I saw a LW article about Bitcoin, which prompted me to start mining it. I wrote at the time, "thanks to the discussion you started, I bought a Radeon 5870 and started mining myself, since it looks likely that I can at least break even on the cost of the card." That approximately $200 investment (plus maybe another $100 in electricity) is also worth around six figures today.
Clearly, technological advances can sometimes create gold rush-like situations (i.e., first-come-first-serve opportunities to make truly extraordinary returns with minimal effort or qualifications). And it's possible to stumble into them without even trying. Which makes me think, maybe we should be trying? I mean, if only I had been looking for possible gold rushes, I could have registered a hundred domain names optimized for potential future value, rather than the few that I happened to personally need. Or I could have started mining Bitcoins a couple of years earlier and be a thousand times richer.
I wish I was already an experienced gold rush spotter, so I could explain how best to do it, but as indicated above, I participated in the ones that I did more or less by luck. Perhaps the first step is just to keep one's eyes open, and to keep in mind that tech-related gold rushes do happen from time to time and they are not impossibly difficult to find. What other ideas do people have? Are there other past examples of tech gold rushes besides the two that I mentioned? What might be some promising fields to look for them in the future?
On the subject of how an FAI team can avoid accidentally creating a UFAI, Carl Shulman wrote:
If we condition on having all other variables optimized, I'd expect a team to adopt very high standards of proof, and recognize limits to its own capabilities, biases, etc. One of the primary purposes of organizing a small FAI team is to create a team that can actually stop and abandon a line of research/design (Eliezer calls this "halt, melt, and catch fire") that cannot be shown to be safe (given limited human ability, incentives and bias).
In the history of philosophy, there have been many steps in the right direction, but virtually no significant problems have been fully solved, such that philosophers can agree that some proposed idea can be the last words on a given subject. An FAI design involves making many explicit or implicit philosophical assumptions, many of which may then become fixed forever as governing principles for a new reality. They'll end up being last words on their subjects, whether we like it or not. Given the history of philosophy and applying the outside view, how can an FAI team possibly reach "very high standards of proof" regarding the safety of a design? But if we can foresee that they can't, then what is the point of aiming for that predictable outcome now?
Until recently I haven't paid a lot of attention to the discussions here about inside view vs outside view, because the discussions have tended to focus on the applicability of these views to the problem of predicting intelligence explosion. It seemed obvious to me that outside views can't possibly rule out intelligence explosion scenarios, and even a small probability of a future intelligence explosion would justify a much higher than current level of investment in preparing for that possibility. But given that the inside vs outside view debate may also be relevant to the "FAI Endgame", I read up on Eliezer and Luke's most recent writings on the subject... and found them to be unobjectionable. Here's Eliezer:
On problems that are drawn from a barrel of causally similar problems, where human optimism runs rampant and unforeseen troubles are common, the Outside View beats the Inside View.
Does anyone want to argue that Eliezer's criteria for using the outside view are wrong, or don't apply here?
One obvious solution is to use multiple reference classes, and weight them by how relevant you think they are to the phenomenon you're trying to predict.
Once you've combined a handful of models to arrive at a qualitative or quantitative judgment, you should still be able to "adjust" the judgment in some cases using an inside view.
These ideas seem harder to apply, so I'll ask for readers' help. What reference classes should we use here, in addition to past attempts to solve philosophical problems? What inside view adjustments could a future FAI team make, such that they might justifiably overcome (the most obvious-to-me) outside view's conclusion that they're very unlikely to be in the possession of complete and fully correct solutions to a diverse range of philosophical problems?
I put "Friendliness" in quotes in the title, because I think what we really want, and what MIRI seems to be working towards, is closer to "optimality": create an AI that minimizes the expected amount of astronomical waste. In what follows I will continue to use "Friendly AI" to denote such an AI since that's the established convention.
I've often stated my objections MIRI's plan to build an FAI directly (instead of after human intelligence has been substantially enhanced). But it's not because, as some have suggested while criticizing MIRI's FAI work, that we can't foresee what problems need to be solved. I think it's because we can largely foresee what kinds of problems need to be solved to build an FAI, but they all look superhumanly difficult, either due to their inherent difficulty, or the lack of opportunity for "trial and error", or both.
When people say they don't know what problems need to be solved, they may be mostly talking about "AI safety" rather than "Friendly AI". If you think in terms of "AI safety" (i.e., making sure some particular AI doesn't cause a disaster) then that does looks like a problem that depends on what kind of AI people will build. "Friendly AI" on the other hand is really a very different problem, where we're trying to figure out what kind of AI to build in order to minimize astronomical waste. I suspect this may explain the apparent disagreement, but I'm not sure. I'm hoping that explaining my own position more clearly will help figure out whether there is a real disagreement, and what's causing it.
The basic issue I see is that there is a large number of serious philosophical problems facing an AI that is meant to take over the universe in order to minimize astronomical waste. The AI needs a full solution to moral philosophy to know which configurations of particles/fields (or perhaps which dynamical processes) are most valuable and which are not. Moral philosophy in turn seems to have dependencies on the philosophy of mind, consciousness, metaphysics, aesthetics, and other areas. The FAI also needs solutions to many problems in decision theory, epistemology, and the philosophy of mathematics, in order to not be stuck with making wrong or suboptimal decisions for eternity. These essentially cover all the major areas of philosophy.
For an FAI builder, there are three ways to deal with the presence of these open philosophical problems, as far as I can see. (There may be other ways for the future to turns out well without the AI builders making any special effort, for example if being philosophical is just a natural attractor for any superintelligence, but I don't see any way to be confident of this ahead of time.) I'll name them for convenient reference, but keep in mind that an actual design may use a mixture of approaches.
- Normative AI - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
- Black-Box Metaphilosophical AI - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.
- White-Box Metaphilosophical AI - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.
The problem with Normative AI, besides the obvious inherent difficulty (as evidenced by the slow progress of human philosophers after decades, sometimes centuries of work), is that it requires us to anticipate all of the philosophical problems the AI might encounter in the future, from now until the end of the universe. We can certainly foresee some of these, like the problems associated with agents being copyable, or the AI radically changing its ontology of the world, but what might we be missing?
Black-Box Metaphilosophical AI is also risky, because it's hard to test/debug something that you don't understand. Besides that general concern, designs in this category (such as Paul Christiano's take on indirect normativity) seem to require that the AI achieve superhuman levels of optimizing power before being able to solve its philosophical problems, which seems to mean that a) there's no way to test them in a safe manner, and b) it's unclear why such an AI won't cause disaster in the time period before it achieves philosophical competence.
White-Box Metaphilosophical AI may be the most promising approach. There is no strong empirical evidence that solving metaphilosophy is superhumanly difficult, simply because not many people have attempted to solve it. But I don't think that a reasonable prior combined with what evidence we do have (i.e., absence of visible progress or clear hints as to how to proceed) gives much hope for optimism either.
To recap, I think we can largely already see what kinds of problems must be solved in order to build a superintelligent AI that will minimize astronomical waste while colonizing the universe, and it looks like they probably can't be solved correctly with high confidence until humans become significantly smarter than we are now. I think I understand why some people disagree with me (e.g., Eliezer thinks these problems just aren't that hard, relative to his abilities), but I'm not sure why some others say that we don't yet know what the problems will be.
I find Eliezer's explanation of what "should" means to be unsatisfactory, and here's an attempt to do better. Consider the following usages of the word:
- You should stop building piles of X pebbles because X = Y*Z.
- We should kill that police informer and dump his body in the river.
- You should one-box in Newcomb's problem.
All of these seem to be sensible sentences, depending on the speaker and intended audience. #1, for example, seems a reasonable translation of what a pebblesorter would say after discovering that X = Y*Z. Some might argue for "pebblesorter::should" instead of plain "should", but it's hard to deny that we need "should" in some form to fill the blank there for a translation, and I think few people besides Eliezer would object to plain "should".
Normativity, or the idea that there's something in common about how "should" and similar words are used in different contexts, is an active area in academic philosophy. I won't try to survey the current theories, but my current thinking is that "should" usually means "better according to some shared, motivating standard or procedure of evaluation", but occasionally it can also be used to instill such a standard or procedure of evaluation in someone (such as a child) who is open to being instilled by the speaker/writer.
It seems to me that different people (including different humans) can have different motivating standards and procedures of evaluation, and apparent disagreements about "should' sentences can arise from having different standards/procedures or from disagreement about whether something is better according to a shared standard/procedure. In most areas my personal procedure of evaluation is something that might be called "doing philosophy" but many people apparently do not share this. For example a religious extremist may have been taught by their parents, teachers, or peers to follow some rigid moral code given in their holy books, and not be open to any philosophical arguments that I can offer.
Of course this isn't a fully satisfactory theory of normativity since I don't know what "philosophy" really is (and I'm not even sure it really is a thing). But it does help explain how "should" in morality might relate to "should" in other areas such as decision theory, does not require assuming that all humans ultimately share the same morality, and avoids the need for linguistic contortions such as "pebblesorter::should".
I don't know what my values are. I don't even know how to find out what my values are. But do I know something about how I (or an FAI) may be able to find out what my values are? Perhaps... and I've organized my answer to this question in the form of an "Outline of Possible Sources of Values". I hope it also serves as a summary of the major open problems in this area.
- other humans
- other agents
- actual (historical/observed) behavior
- counterfactual (simulated/predicted) behavior
- Subconscious Cognition
- model-based decision making
- heuristics for extrapolating/updating model
- (partial) utility function
- model-free decision making
- identity based (adopt a social role like "environmentalist" or "academic" and emulate an appropriate role model, actual or idealized)
- reinforcement based
- model-based decision making
- Conscious Cognition
- decision making using explicit verbal and/or quantitative reasoning
- consequentialist (similar to model-based above, but using explicit reasoning)
- virtue ethical
- identity based
- reasoning about terminal goals/values/preferences/moral principles
- responses (changes in state) to moral arguments (possibly context dependent)
- distributions of autonomously generated moral arguments (possibly context dependent)
- logical structure (if any) of moral reasoning
- object-level intuitions/judgments
- about what one should do in particular ethical situations
- about the desirabilities of particular outcomes
- about moral principles
- meta-level intuitions/judgments
- about the nature of morality
- about the complexity of values
- about what the valid sources of values are
- about what constitutes correct moral reasoning
- about how to explicitly/formally/effectively represent values (utility function, multiple utility functions, deontological rules, or something else) (if utility function(s), for what decision theory and ontology?)
- about how to extract/translate/combine sources of values into a representation of values
- how to solve ontological crisis
- how to deal with native utility function or revealed preferences being partial
- how to translate non-consequentialist sources of values into utility function(s)
- how to deal with moral principles being vague and incomplete
- how to deal with conflicts between different sources of values
- how to deal with lack of certainty in one's intuitions/judgments
- whose intuition/judgment ought to be applied? (may be different for each of the above)
- the subject's (at what point in time? current intuitions, eventual judgments, or something in between?)
- the FAI designers'
- the FAI's own philosophical conclusions
- decision making using explicit verbal and/or quantitative reasoning
Using this outline, we can obtain a concise understanding of what many metaethical theories and FAI proposals are claiming/suggesting and how they differ from each other. For example, Nyan_Sandwich's "morality is awesome" thesis can be interpreted as the claim that the most important source of values is our intuitions about the desirability (awesomeness) of particular outcomes.
As another example, Aaron Swartz argued against "reflective equilibrium" by which he meant the claim that the valid sources of values are our object-level moral intuitions, and that correct moral reasoning consists of working back and forth between these intuitions until they reach coherence. His own position was that intuitions about moral principles are the only valid source of values and we should discount our intuitions about particular ethical situations.
A final example is Paul Christiano's "Indirect Normativity" proposal (n.b., "Indirect Normativity" was originally coined by Nick Bostrom to refer to an entire class of designs where the AI's values are defined "indirectly") for FAI, where an important source of values is the distribution of moral arguments the subject is likely to generate in a particular simulated environment and their responses to those arguments. Also, just about every meta-level question is left for the (simulated) subject to answer, except for the decision theory and ontology of the utility function that their values must finally be encoded in, which is fixed by the FAI designer.
I think the outline includes most of the ideas brought up in past LW discussions, or in moral philosophies that I'm familiar with. Please let me know if I left out anything important.
View more: Next