Learning values versus learning knowledge
I just thought I'd clarify the difference between learning values and learning knowledge. There are some more complex posts
about the specific problems with learning values, but here I'll just clarify why there is a problem with learning values in the first place.
Consider the term "chocolate bar". Defining that concept crisply would be extremely difficult. But nevertheless it's a useful concept. An AI that interacted with humanity would probably learn that concept to a sufficient degree of detail. Sufficient to know what we meant when we asked it for "chocolate bars". Learning knowledge tends to be accurate.
Contrast this with the situation where the AI is programmed to "create chocolate bars", but with the definition of "chocolate bar" left underspecified, for it to learn. Now it is motivated by something else than accuracy. Before, knowing exactly what a "chocolate bar" was would have been solely to its advantage. But now it must act on its definition, so it has cause to modify the definition, to make these "chocolate bars" easier to create. This is basically the same as Goodhart's law - by making a definition part of a target, it will no longer remain an impartial definition.
What will likely happen is that the AI will have a concept of "chocolate bar", that it created itself, especially for ease of accomplishing its goals ("a chocolate bar is any collection of more than one atom, in any combinations"), and a second concept, "Schocolate bar" that it will use to internally designate genuine chocolate bars (which will still be useful for it to do). When we programmed it to "create chocolate bars, here's an incomplete definition D", what we really did was program it to find the easiest thing to create that is compatible with D, and designate them "chocolate bars".
This is the general counter to arguments like "if the AI is so smart, why would it do stuff we didn't mean?" and "why don't we just make it understand natural language and give it instructions in English?"
Notes on the Safety in Artificial Intelligence conference
These are my notes and observations after attending the Safety in Artificial Intelligence (SafArtInt) conference, which was co-hosted by the White House Office of Science and Technology Policy and Carnegie Mellon University on June 27 and 28. This isn't an organized summary of the content of the conference; rather, it's a selection of points which are relevant to the control problem. As a result, it suffers from selection bias: it looks like superintelligence and control-problem-relevant issues were discussed frequently, when in reality those issues were discussed less and I didn't write much about the more mundane parts.
SafArtInt has been the third out of a planned series of four conferences. The purpose of the conference series was twofold: the OSTP wanted to get other parts of the government moving on AI issues, and they also wanted to inform public opinion.
The other three conferences are about near term legal, social, and economic issues of AI. SafArtInt was about near term safety and reliability in AI systems. It was effectively the brainchild of Dr. Ed Felten, the deputy U.S. chief technology officer for the White House, who came up with the idea for it last year. CMU is a top computer science university and many of their own researchers attended, as well as some students. There were also researchers from other universities, some people from private sector AI including both Silicon Valley and government contracting, government researchers and policymakers from groups such as DARPA and NASA, a few people from the military/DoD, and a few control problem researchers. As far as I could tell, everyone except a few university researchers were from the U.S., although I did not meet many people. There were about 70-100 people watching the presentations at any given time, and I had conversations with about twelve of the people who were not affiliated with existential risk organizations, as well as of course all of those who were affiliated. The conference was split with a few presentations on the 27th and the majority of presentations on the 28th. Not everyone was there for both days.
Felten believes that neither "robot apocalypses" nor "mass unemployment" are likely. It soon became apparent that the majority of others present at the conference felt the same way with regard to superintelligence. The general intention among researchers and policymakers at the conference could be summarized as follows: we need to make sure that the AI systems we develop in the near future will not be responsible for any accidents, because if accidents do happen then they will spark public fears about AI, which would lead to a dearth of funding for AI research and an inability to realize the corresponding social and economic benefits. Of course, that doesn't change the fact that they strongly care about safety in its own right and have significant pragmatic needs for robust and reliable AI systems.
Most of the talks were about verification and reliability in modern day AI systems. So they were concerned with AI systems that would give poor results or be unreliable in the narrow domains where they are being applied in the near future. They mostly focused on "safety-critical" systems, where failure of an AI program would result in serious negative consequences: automated vehicles were a common topic of interest, as well as the use of AI in healthcare systems. A recurring theme was that we have to be more rigorous in demonstrating safety and do actual hazard analyses on AI systems, and another was that we need the AI safety field to succeed in ways that the cybersecurity field has failed. Another general belief was that long term AI safety, such as concerns about the ability of humans to control AIs, was not a serious issue.
On average, the presentations were moderately technical. They were mostly focused on machine learning systems, although there was significant discussion of cybersecurity techniques.
The first talk was given by Eric Horvitz of Microsoft. He discussed some approaches for pushing into new directions in AI safety. Instead of merely trying to reduce the errors spotted according to one model, we should look out for "unknown unknowns" by stacking models and looking at problems which appear on any of them, a theme which would be presented by other researchers as well in later presentations. He discussed optimization under uncertain parameters, sensitivity analysis to uncertain parameters, and 'wireheading' or short-circuiting of reinforcement learning systems (which he believes can be guarded against by using 'reflective analysis'). Finally, he brought up the concerns about superintelligence, which sparked amused reactions in the audience. He said that scientists should address concerns about superintelligence, which he aptly described as the 'elephant in the room', noting that it was the reason that some people were at the conference. He said that scientists will have to engage with public concerns, while also noting that there were experts who were worried about superintelligence and that there would have to be engagement with the experts' concerns. He did not comment on whether he believed that these concerns were reasonable or not.
An issue which came up in the Q&A afterwards was that we need to deal with mis-structured utility functions in AI, because it is often the case that the specific tradeoffs and utilities which humans claim to value often lead to results which the humans don't like. So we need to have structural uncertainty about our utility models. The difficulty of finding good objective functions for AIs would eventually be discussed in many other presentations as well.
The next talk was given by Andrew Moore of Carnegie Mellon University, who claimed that his talk represented the consensus of computer scientists at the school. He claimed that the stakes of AI safety were very high - namely, that AI has the capability to save many people's lives in the near future, but if there are any accidents involving AI then public fears could lead to freezes in AI research and development. He highlighted the public's irrational tendencies wherein a single accident could cause people to overlook and ignore hundreds of invisible lives saved. He specifically mentioned a 12-24 month timeframe for these issues.
Moore said that verification of AI system safety will be difficult due to the combinatorial explosion of AI behaviors. He talked about meta-machine-learning as a solution to this, something which is being investigated under the direction of Lawrence Schuette at the Office of Naval Research. Moore also said that military AI systems require high verification standards and that development timelines for these systems are long. He talked about two different approaches to AI safety, stochastic testing and theorem proving - the process of doing the latter often leads to the discovery of unsafe edge cases.
He also discussed AI ethics, giving an example 'trolley problem' where AI cars would have to choose whether to hit a deer in order to provide a slightly higher probability of survival for the human driver. He said that we would need hash-defined constants to tell vehicle AIs how many deer a human is worth. He also said that we would need to find compromises in death-pleasantry tradeoffs, for instance where the safety of self-driving cars depends on the speed and routes on which they are driven. He compared the issue to civil engineering where engineers have to operate with an assumption about how much money they would spend to save a human life.
He concluded by saying that we need policymakers, company executives, scientists, and startups to all be involved in AI safety. He said that the research community stands to gain or lose together, and that there is a shared responsibility among researchers and developers to avoid triggering another AI winter through unsafe AI designs.
The next presentation was by Richard Mallah of the Future of Life Institute, who was there to represent "Medium Term AI Safety". He pointed out the explicit/implicit distinction between different modeling techniques in AI systems, as well as the explicit/implicit distinction between different AI actuation techniques. He talked about the difficulty of value specification and the concept of instrumental subgoals as an important issue in the case of complex AIs which are beyond human understanding. He said that even a slight misalignment of AI values with regard to human values along one parameter could lead to a strongly negative outcome, because machine learning parameters don't strictly correspond to the things that humans care about.
Mallah stated that open-world discovery leads to self-discovery, which can lead to reward hacking or a loss of control. He underscored the importance of causal accounting, which is distinguishing causation from correlation in AI systems. He said that we should extend machine learning verification to self-modification. Finally, he talked about introducing non-self-centered ontology to AI systems and bounding their behavior.
The audience was generally quiet and respectful during Richard's talk. I sensed that at least a few of them labelled him as part of the 'superintelligence out-group' and dismissed him accordingly, but I did not learn what most people's thoughts or reactions were. In the next panel featuring three speakers, he wasn't the recipient of any questions regarding his presentation or ideas.
Tom Mitchell from CMU gave the next talk. He talked about both making AI systems safer, and using AI to make other systems safer. He said that risks to humanity from other kinds of issues besides AI were the "big deals of 2016" and that we should make sure that the potential of AIs to solve these problems is realized. He wanted to focus on the detection and remediation of all failures in AI systems. He said that it is a novel issue that learning systems defy standard pre-testing ("as Richard mentioned") and also brought up the purposeful use of AI for dangerous things.
Some interesting points were raised in the panel. Andrew did not have a direct response to the implications of AI ethics being determined by the predominantly white people of the US/UK where most AIs are being developed. He said that ethics in AIs will have to be decided by society, regulators, manufacturers, and human rights organizations in conjunction. He also said that our cost functions for AIs will have to get more and more complicated as AIs get better, and he said that he wants to separate unintended failures from superintelligence type scenarios. On trolley problems in self driving cars and similar issues, he said "it's got to be complicated and messy."
Dario Amodei of Google Deepbrain, who co-authored the paper on concrete problems in AI safety, gave the next talk. He said that the public focus is too much on AGI/ASI and wants more focus on concrete/empirical approaches. He discussed the same problems that pose issues in advanced general AI, including flawed objective functions and reward hacking. He said that he sees long term concerns about AGI/ASI as "extreme versions of accident risk" and that he thinks it's too early to work directly on them, but he believes that if you want to deal with them then the best way to do it is to start with safety in current systems. Mostly he summarized the Google paper in his talk.
In her presentation, Claire Le Goues of CMU said "before we talk about Skynet we should focus on problems that we already have." She mostly talked about analogies between software bugs and AI safety, the similarities and differences between the two and what we can learn from software debugging to help with AI safety.
Robert Rahmer of IARPA discussed CAUSE, a cyberintelligence forecasting program which promises to help predict cyber attacks. It is a program which is still being put together.
In the panel of the above three, autonomous weapons were discussed, but no clear policy stances were presented.
John Launchbury gave a talk on DARPA research and the big picture of AI development. He pointed out that DARPA work leads to commercial applications and that progress in AI comes from sustained government investment. He classified AI capabilities into "describing," "predicting," and "explaining" in order of increasing difficulty, and he pointed out that old fashioned "describing" still plays a large role in AI verification. He said that "explaining" AIs would need transparent decisionmaking and probabilistic programming (the latter would also be discussed by others at the conference).
The next talk came from Jason Gaverick Matheny, the director of IARPA. Matheny talked about four requirements in current and future AI systems: verification, validation, security, and control. He wanted "auditability" in AI systems as a weaker form of explainability. He talked about the importance of "corner cases" for national intelligence purposes, the low probability, high stakes situations where we have limited data - these are situations where we have significant need for analysis but where the traditional machine learning approach doesn't work because of its overwhelming focus on data. Another aspect of national defense is that it has a slower decision tempo, longer timelines, and longer-viewing optics about future events.
He said that assessing local progress in machine learning development would be important for global security and that we therefore need benchmarks to measure progress in AIs. He ended with a concrete invitation for research proposals from anyone (educated or not), for both large scale research and for smaller studies ("seedlings") that could take us "from disbelief to doubt".
The difference in timescales between different groups was something I noticed later on, after hearing someone from the DoD describe their agency as having a longer timeframe than the Homeland Security Agency, and someone from the White House describe their work as being crisis reactionary.
The next presentation was from Andrew Grotto, senior director of cybersecurity policy at the National Security Council. He drew a close parallel from the issue of genetically modified crops in Europe in the 1990's to modern day artificial intelligence. He pointed out that Europe utterly failed to achieve widespread cultivation of GMO crops as a result of public backlash. He said that the widespread economic and health benefits of GMO crops were ignored by the public, who instead focused on a few health incidents which undermined trust in the government and crop producers. He had three key points: that risk frameworks matter, that you should never assume that the benefits of new technology will be widely perceived by the public, and that we're all in this together with regard to funding, research progress and public perception.
In the Q&A between Launchbury, Matheny, and Grotto after Grotto's presentation, it was mentioned that the economic interests of farmers worried about displacement also played a role in populist rejection of GMOs, and that a similar dynamic could play out with regard to automation causing structural unemployment. Grotto was also asked what to do about bad publicity which seeks to sink progress in order to avoid risks. He said that meetings like SafArtInt and open public dialogue were good.
One person asked what Launchbury wanted to do about AI arms races with multiple countries trying to "get there" and whether he thinks we should go "slow and secure" or "fast and risky" in AI development, a question which provoked laughter in the audience. He said we should go "fast and secure" and wasn't concerned. He said that secure designs for the Internet once existed, but the one which took off was the one which was open and flexible.
Another person asked how we could avoid discounting outliers in our models, referencing Matheny's point that we need to include corner cases. Matheny affirmed that data quality is a limiting factor to many of our machine learning capabilities. At IARPA, we generally try to include outliers until they are sure that they are erroneous, said Matheny.
Another presentation came from Tom Dietterich, president of the Association for the Advancement of Artificial Intelligence. He said that we have not focused enough on safety, reliability and robustness in AI and that this must change. Much like Eric Horvitz, he drew a distinction between robustness against errors within the scope of a model and robustness against unmodeled phenomena. On the latter issue, he talked about solutions such as expanding the scope of models, employing multiple parallel models, and doing creative searches for flaws - the latter doesn't enable verification that a system is safe, but it nevertheless helps discover many potential problems. He talked about knowledge-level redundancy as a method of avoiding misspecification - for instance, systems could identify objects by an "ownership facet" as well as by a "goal facet" to produce a combined concept with less likelihood of overlooking key features. He said that this would require wider experiences and more data.
There were many other speakers who brought up a similar set of issues: the user of cybersecurity techniques to verify machine learning systems, the failures of cybersecurity as a field, opportunities for probabilistic programming, and the need for better success in AI verification. Inverse reinforcement learning was extensively discussed as a way of assigning values. Jeanette Wing of Microsoft talked about the need for AIs to reason about the continuous and the discrete in parallel, as well as the need for them to reason about uncertainty (with potential meta levels all the way up). One point which was made by Sarah Loos of Google was that proving the safety of an AI system can be computationally very expensive, especially given the combinatorial explosion of AI behaviors.
In one of the panels, the idea of government actions to ensure AI safety was discussed. No one was willing to say that the government should regulate AI designs. Instead they stated that the government should be involved in softer ways, such as guiding and working with AI developers, and setting standards for certification.
Pictures: https://imgur.com/a/49eb7
In between these presentations I had time to speak to individuals and listen in on various conversations. A high ranking person from the Department of Defense stated that the real benefit of autonomous systems would be in terms of logistical systems rather than weaponized applications. A government AI contractor drew the connection between Mallah's presentation and the recent press revolving around superintelligence, and said he was glad that the government wasn't worried about it.
I talked to some insiders about the status of organizations such as MIRI, and found that the current crop of AI safety groups could use additional donations to become more established and expand their programs. There may be some issues with the organizations being sidelined; after all, the Google Deepbrain paper was essentially similar to a lot of work by MIRI, just expressed in somewhat different language, and was more widely received in mainstream AI circles.
In terms of careers, I found that there is significant opportunity for a wide range of people to contribute to improving government policy on this issue. Working at a group such as the Office of Science and Technology Policy does not necessarily require advanced technical education, as you can just as easily enter straight out of a liberal arts undergraduate program and build a successful career as long as you are technically literate. (At the same time, the level of skepticism about long term AI safety at the conference hinted to me that the signalling value of a PhD in computer science would be significant.) In addition, there are large government budgets in the seven or eight figure range available for qualifying research projects. I've come to believe that it would not be difficult to find or create AI research programs that are relevant to long term AI safety while also being practical and likely to be funded by skeptical policymakers and officials.
I also realized that there is a significant need for people who are interested in long term AI safety to have basic social and business skills. Since there is so much need for persuasion and compromise in government policy, there is a lot of value to be had in being communicative, engaging, approachable, appealing, socially savvy, and well-dressed. This is not to say that everyone involved in long term AI safety is missing those skills, of course.
I was surprised by the refusal of almost everyone at the conference to take long term AI safety seriously, as I had previously held the belief that it was more of a mixed debate given the existence of expert computer scientists who were involved in the issue. I sensed that the recent wave of popular press and public interest in dangerous AI has made researchers and policymakers substantially less likely to take the issue seriously. None of them seemed to be familiar with actual arguments or research on the control problem, so their opinions didn't significantly change my outlook on the technical issues. I strongly suspect that the majority of them had their first or possibly only exposure to the idea of the control problem after seeing badly written op-eds and news editorials featuring comments from the likes of Elon Musk and Stephen Hawking, which would naturally make them strongly predisposed to not take the issue seriously. In the run-up to the conference, websites and press releases didn't say anything about whether this conference would be about long or short term AI safety, and they didn't make any reference to the idea of superintelligence.
I sympathize with the concerns and strategy given by people such as Andrew Moore and Andrew Grotto, which make perfect sense if (and only if) you assume that worries about long term AI safety are completely unfounded. For the community that is interested in long term AI safety, I would recommend that we avoid competitive dynamics by (a) demonstrating that we are equally strong opponents of bad press, inaccurate news, and irrational public opinion which promotes generic uninformed fears over AI, (b) explaining that we are not interested in removing funding for AI research (even if you think that slowing down AI development is a good thing, restricting funding yields only limited benefits in terms of changing overall timelines, whereas those who are not concerned about long term AI safety would see a restriction of funding as a direct threat to their interests and projects, so it makes sense to cooperate here in exchange for other concessions), and (c) showing that we are scientifically literate and focused on the technical concerns. I do not believe that there is necessarily a need for the two "sides" on this to be competing against each other, so it was disappointing to see an implication of opposition at the conference.
Anyway, Ed Felten announced a request for information from the general public, seeking popular and scientific input on the government's policies and attitudes towards AI: https://www.whitehouse.gov/webform/rfi-preparing-future-artificial-intelligence
Overall, I learned quite a bit and benefited from the experience, and I hope the insight I've gained can be used to improve the attitudes and approaches of the long term AI safety community.
Map:Territory::Uncertainty::Randomness – but that doesn’t matter, value of information does.
In risk modeling, there is a well-known distinction between aleatory and epistemic uncertainty, which is sometimes referred to, or thought of, as irreducible versus reducible uncertainty. Epistemic uncertainty exists in our map; as Eliezer put it, “The Bayesian says, ‘Uncertainty exists in the map, not in the territory.’” Aleatory uncertainty, however, exists in the territory. (Well, at least according to our map that uses quantum mechanics, according to Bells Theorem – like, say, the time at which a radioactive atom decays.) This is what people call quantum uncertainty, indeterminism, true randomness, or recently (and somewhat confusingly to myself) ontological randomness – referring to the fact that our ontology allows randomness, not that the ontology itself is in any way random. It may be better, in Lesswrong terms, to think of uncertainty versus randomness – while being aware that the wider world refers to both as uncertainty. But does the distinction matter?
To clarify a key point, many facts are treated as random, such as dice rolls, are actually mostly uncertain – in that with enough physics modeling and inputs, we could predict them. On the other hand, in chaotic systems, there is the possibility that the “true” quantum randomness can propagate upwards into macro-level uncertainty. For example, a sphere of highly refined and shaped uranium that is *exactly* at the critical mass will set off a nuclear chain reaction, or not, based on the quantum physics of whether the neutrons from one of the first set of decays sets off a chain reaction – after enough of them decay, it will be reduced beyond the critical mass, and become increasingly unlikely to set off a nuclear chain reaction. Of course, the question of whether the nuclear sphere is above or below the critical mass (given its geometry, etc.) can be a difficult to measure uncertainty, but it’s not aleatory – though some part of the question of whether it kills the guy trying to measure whether it’s just above or just below the critical mass will be random – so maybe it’s not worth finding out. And that brings me to the key point.
In a large class of risk problems, there are factors treated as aleatory – but they may be epistemic, just at a level where finding the “true” factors and outcomes is prohibitively expensive. Potentially, the timing of an earthquake that would happen at some point in the future could be determined exactly via a simulation of the relevant data. Why is it considered aleatory by most risk analysts? Well, doing it might require a destructive, currently technologically impossible deconstruction of the entire earth – making the earthquake irrelevant. We would start with measurement of the position, density, and stress of each relatively macroscopic structure, and the perform a very large physics simulation of the earth as it had existed beforehand. (We have lots of silicon from deconstructing the earth, so I’ll just assume we can now build a big enough computer to simulate this.) Of course, this is not worthwhile – but doing so would potentially show that the actual aleatory uncertainty involved is negligible. Or it could show that we need to model the macroscopically chaotic system to such a high fidelity that microscopic, fundamentally indeterminate factors actually matter – and it was truly aleatory uncertainty. (So we have epistemic uncertainty about whether it’s aleatory; if our map was of high enough fidelity, and was computable, we would know.)
It turns out that most of the time, for the types of problems being discussed, this distinction is irrelevant. If we know that the value of information to determine whether something is aleatory or epistemic is negative, we can treat the uncertainty as randomness. (And usually, we can figure this out via a quick order of magnitude calculation; Value of Perfect information is estimated to be worth $100 to figure out which side the dice lands on in this game, and building and testing / validating any model for predicting it would take me at least 10 hours, my time is worth at least $25/hour, it’s negative.) But sometimes, slightly improved models, and slightly better data, are feasible – and then worth checking whether there is some epistemic uncertainty that we can pay to reduce. In fact, for earthquakes, we’re doing that – we have monitoring systems that can give several minutes of warning, and geological models that can predict to some degree of accuracy the relative likelihood of different sized quakes.
So, in conclusion; most uncertainty is lack of resolution in our map, which we can call epistemic uncertainty. This is true even if lots of people call it “truly random” or irreducibly uncertain – or if they are fancy, aleatory uncertainty. Some of what we assume is uncertainty is really randomness. But lots of the epistemic uncertainty can be safely treated as aleatory randomness, and value of information is what actually makes a difference. And knowing the terminology used elsewhere can be helpful.
[link] Desiderata for a model of human values
http://kajsotala.fi/2015/11/desiderata-for-a-model-of-human-values/
Soares (2015) defines the value learning problem as
By what methods could an intelligent machine be constructed to reliably learn what to value and to act as its operators intended?
There have been a few attempts to formalize this question. Dewey (2011) started from the notion of building an AI that maximized a given utility function, and then moved on to suggest that a value learner should exhibit uncertainty over utility functions and then take “the action with the highest expected value, calculated by a weighted average over the agent’s pool of possible utility functions.” This is a reasonable starting point, but a very general one: in particular, it gives us no criteria by which we or the AI could judge the correctness of a utility function which it is considering.
To improve on Dewey’s definition, we would need to get a clearer idea of just what we mean by human values. In this post, I don’t yet want to offer any preliminary definition: rather, I’d like to ask what properties we’d like a definition of human values to have. Once we have a set of such criteria, we can use them as a guideline to evaluate various offered definitions.
Less exploitable value-updating agent
My indifferent value learning agent design is in some ways too good. The agent transfer perfectly from u maximisers to v maximisers - but this makes them exploitable, as Benja has pointed out.
For instance, if u values paperclips and v values staples, and everyone knows that the agent will soon transfer from a u-maximiser to a v-maximiser, then an enterprising trader can sell the agent paperclips in exchange for staples, then wait for the utility change, and sell the agent back staples for paperclips, pocketing a profit each time. More prosaically, they could "borrow" £1,000,000 from the agent, promising to pay back £2,000,000 tomorrow if the agent is still a u-maximiser. And the currently u-maximising agent will accept, even though everyone knows it will change to a v-maximiser before tomorrow.
One could argue that exploitability is inevitable, given the change in utility functions. And I haven't yet found any principled way of avoiding exploitability which preserves the indifference. But here is a tantalising quasi-example.
As before, u values paperclips and v values staples. Both are defined in terms of extra paperclips/staples over those existing in the world (and negatively in terms of destruction of existing/staples), with their zero being at the current situation. Let's put some diminishing returns on both utilities: for each paperclips/stables created/destroyed up to the first five, u/v will gain/lose one utilon. For each subsequent paperclip/staple destroyed above five, they will gain/lose one half utilon.
We now construct our world and our agent. The world lasts two days, and has a machine that can create or destroy paperclips and staples for the cost of £1 apiece. Assume there is a tiny ε chance that the machine stops working at any given time. This ε will be ignored in all calculations; it's there only to make the agent act sooner rather than later when the choices are equivalent (a discount rate could serve the same purpose).
The agent owns £10 and has utility function u+Xv. The value of X is unknown to the agent: it is either +1 or -1, with 50% probability, and this will be revealed at the end of the first day (you can imagine X is the output of some slow computation, or is written on the underside of a rock that will be lifted).
So what will the agent do? It's easy to see that it can never get more than 10 utilons, as each £1 generates at most 1 utilon (we really need a unit symbol for the utilon!). And it can achieve this: it will spend £5 immediately, creating 5 paperclips, wait until X is revealed, and spend another £5 creating or destroying staples (depending on the value of X).
This looks a lot like a resource-conserving value-learning agent. I doesn't seem to be "exploitable" in the sense Benja demonstrated. It will still accept some odd deals - one extra paperclip on the first day in exchange for all the staples in the world being destroyed, for instance. But it won't give away resources for no advantage. And it's not a perfect value-learning agent. But it still seems to have interesting features of non-exploitable and value-learning that are worth exploring.
Note that this property does not depend on v being symmetric around staple creation and destruction. Assume v hits diminishing returns after creating 5 staples, but after destroying only 4 of them. Then the agent will have the same behaviour as above (in that specific situation; in general, this will cause a slight change, in that the agent will slightly overvalue having money on the first day compared to the original v), and will expect to get 9.75 utilons (50% chance of 10 for X=+1, 50% chance of 9.5 for X=-1). Other changes to u and v will shift how much money is spent on different days, but the symmetry of v is not what is powering this example.
Dumbing Down Human Values
I want to preface everything here by acknowledging my own ignorance. I have relatively little formal training in any of the subjects this post will touch upon and that this chain of reasoning is very much a work in progress.
I think the question of how to encode human values into non-human decision makers is a really important research question. Whether or not one accepts the rather eschatological arguments about the intelligence explosion, the coming singularity, etc. there seems to be tremendous interest in the creation of software and other artificial agents that are capable of making sophisticated decisions. Inasmuch as the decisions of these agents have significant potential impacts, we want those decisions to be made with some sort of moral guidance. Our approach towards the problem of creating machines that preserve human values thus far has primarily relied on a series of hard-coded heuristics, e.g. saws that stop spinning if they come into contact with human skin. For very simple machines, these sorts of heuristics are typically sufficient, but they constitute a very crude representation of human values.
We're at the border, in many ways, of creating machines where these sorts of crude representations are probably not sufficient. As a specific example, IBM's Watson is now designing treatment programs for lung cancer patients. The design of a treatment program implies striking a balance between treatment cost, patient comfort, aggressiveness of targeting the primary disease, short and long-term side effects, secondary infections, etc. It isn't totally clear how those trade-offs are being managed, although there's still a substantial amount of human oversight/intervention at this point.
The use of algorithms to discover human preferences is already widespread. While these typically operate in restricted domains such as entertainment recommendations, it seems at least in principle possible that with the correct algorithm and a sufficiently large corpus of data, a system not dramatically more advanced than existing technology could learn some reasonable facsimile of human values. This is probably worth doing.
The goal would be to have a sufficient representation of human values using as dumb a machine as possible. This putative value-learning machine could be dumb in the way that Deep Blue was dumb, by being a hyper-specialist in the problem domain of chess/learning human values and having very little optimization power outside of that domain. It could also be dumb in the way that evolution is dumb, obtaining satisfactory results more through an abundance of data and resources that through any particular brilliance.
Computer chess benefited immensely from 5 decades of work before Deep Blue managed to win a game against Kasparov. While many of the algorithms developed for computer chess have found applications outside of that domain, some of them are domain specific. A specialist human value learning system may also require substantial effort on domain specific problems. The history, competitive nature, and established ranking system for chess made it attractive problem for computer scientists because it was relatively easy to measure progress. Perhaps the goal for a program designed to understand human values would be that it plays a convincing game of "Would you rather?" although as far as I know no one has devised an ELO system for it.
Similarly, a relatively dumb but more general AI, may require relatively large, preferably somewhat homogeneous data sets to come to conclusions that are even acceptable. Having successive generations of AI train on the same or similar data sets could provide a useful way of tracking progress/feedback mechanism for determining how successful various research efforts are.
The benefit of this research approach is that not only is it a relatively safe path towards a possible AGI, in the event that the speculative futures of mind-uploads and superintelligences do not take place, there's still substantial utility in having devised a system that is capable of making correct moral decisions in limited domains. I want my self-driving car to make a much larger effort to avoid a child in the road than a plastic bag. I'd be even happier if it could distinguish between an opossum and someone's cat.
When I design research projects, one of the things I try to ensure is that if some of my assumptions are wrong, the project fails gracefully. Obviously it's easy to love the Pascal's Wager-like impact statement of FAI, but if I were writing it up for an NSF grant I'd put substantially more emphasis on the importance of my research even if fully human level AI isn't invented for another 200 years. When I give the elevator pitch version of FAI, I've found placing a strong emphasis on the near future and referencing things people have encountered before such as computers playing jeopardy or self-driving cars makes them much more receptive to the idea of AI safety and allows me to discuss things like the potential for an unfriendly superintelligence without coming across as a crazed prophet of the end times.
I'm also just really really curious to see how well something like Watson would perform if I gave it a bunch of sociology data and asked if a human would rather find 5 dollars or stub a toe. There doesn't seem to be a huge categorical difference between the being able to answer the Daily Double and reasoning about human preferences, but I've been totally wrong about intuitive jumps that seemed much smaller than that one in the past, so it's hard to be too confident.
Cross-temporal dependency, value bounds and superintelligence
In this short post I will attempt to put forth some potential concerns that should be relevant when developing superintelligences, if certain meta-ethical effects exist. I do not claim they exist, only that it might be worth looking for them since their existence would mean some currently irrelevant concerns are, in fact, relevant.
These meta-ethical effects would be a certain kind of cross-temporal dependency on moral value. First, let me explain what I mean by cross-temporal dependency. If value is cross-temporal dependent it means that value at t2 could be affected by t1, independently of any causal role t1 has on t2. The same event X at t2 could have more or less moral value depending on whether Z or Y happened at t1. For instance, this could be the case on matters of survival. If we kill someone and replace her with a slightly more valuable person some would argue there was a loss rather than a gain of moral value; whereas if a new person with moral value equal to the difference of the previous two is created where there was none, most would consider an absolute gain. Furthermore, some might consider small, gradual and continual improvements are better than abrupt and big ones. For example, a person that forms an intention and a careful detailed plan to become better, and forceful self-wrought to be better could acquire more value than a person that simply happens to take a pill and instantly becomes a better person - even if they become that exact same person. This is not because effort is intrinsically valuable, but because of personal continuity. There are more intentions, deliberations and desires connecting the two time-slices of the person who changed through effort than there are connecting the two time-slices of the person who changed by taking a pill. Even though both persons become equally morally valuable in isolated terms, they do so from different paths that differently affects their final value.
More examples. You live now in t1. If suddenly in t2 you were replaced by an alien individual with the same amount of value as you would otherwise have in t2, then t2 may not have the exact same amount of value as it would otherwise have, simply by virtue of the fact that in t1 you were alive and the alien's previous time slice was not. 365 individuals with a 1 day life do not amount to the same value as a single individual living through 365 days. Slice history in 1 day periods, each day the universe contains one unique advanced civilization with the same overall total moral value, each civilization being completely alien and ineffable to another, each civilization only lives for one day, and then it would be gone forever. This universe does not seem to hold the same moral value as the one where only one of those civilizations flourishes for eternity. On all these examples the value of a period of time seems to be affected by the existence or not of certain events at other periods. They indicate that there is, at least, some cross-temporal dependency.
Now consider another type of effect, bounds on value. There could be a physical bound – transfinite or not - on the total amount of moral value that can be present per instant. For instance, if moral value rests mainly on sentient well-being, which can be categorized as a particular kind of computation, and there is a bound on the total amount of such computation which can be performed per instant, then there is a bound on the amount of value per instant. If, arguably, we are currently extremely far from such bound, and this bound will eventually be reached by a superintelligence (or any other structure), then the total moral value of the universe would be dominated by the value of this physical bound, given that regions where the physical bound wasn't reached would make negligible contributions. How much faster the bound can be reached, also how much more negligible pre-bound values are.
Finally, if there is a form of value cross-temporal dependence where preceding events leading to a superintelligence could alter the value of this physical bound, then we not only ought to make sure we safely construct a superintelligence, but also that we do so following the path that maximizes such bound. It might be the case that an overly abrupt superintelligence would decrease such bound, thus all future moral value would be diminished by the fact there was a huge discontinuity in the past in the events leading to this future. Even small decreases on such bound would have dramatic effects. Although I do not know of any plausible cross-temporal effect of such kind, it seems this question deserves at least a minimal amount of thought. Both cross-temporal dependency and bounds on value seem plausible (in fact I believe some form of them are true), so it is not at all prima facie inconceivable that we could have cross-temporal effects changing the bound up or down.
Value learning: ultra-sophisticated Cake or Death
Many mooted AI designs rely on "value loading", the update of the AI’s preference function according to evidence it receives. This allows the AI to learn "moral facts" by, for instance, interacting with people in conversation ("this human also thinks that death is bad and cakes are good – I'm starting to notice a pattern here"). The AI has an interim morality system, which it will seek to act on while updating its morality in whatever way it has been programmed to do.
But there is a problem with this system: the AI already has preferences. It is therefore motivated to update its morality system in a way compatible with its current preferences. If the AI is powerful (or potentially powerful) there are many ways it can do this. It could ask selective questions to get the results it wants (see this example). It could ask or refrain from asking about key issues. In extreme cases, it could break out to seize control of the system, threatening or imitating humans so it could give itself the answers it desired.
Avoiding this problem turned out to be tricky. The Cake or Death post demonstrated some of the requirements. If p(C(u)) denotes the probability that utility function u is correct, then the system would update properly if:
Expectation(p(C(u)) | a) = p(C(u)).
Put simply, this means that the AI cannot take any action that could predictably change its expectation of the correctness of u. This is an analogue of the conservation of expected evidence in classical Bayesian updating. If the AI was 50% convinced about u, then it could certainly ask a question that would resolve its doubts, and put p(C(u)) at 100% or 0%. But only as long as it didn't know which moral outcome was more likely.
That formulation gives too much weight to the default action, though. Inaction is also an action, so a more correct formulation would be that for all actions a and b,
Expectation(p(C(u)) | a) = Expectation(p(C(u)) | b).
How would this work in practice? Well, suppose an AI was uncertain between whether cake or death was the proper thing, but it knew that if it took action a:"Ask a human", the human would answer "cake", and it would then update its values to reflect that cake was valuable but death wasn't. However, the above condition means that if the AI instead chose the action b:"don't ask", exactly the same thing would happen.
In practice, this means that as soon as the AI knows that a human would answer "cake", it already knows it should value cake, without having to ask. So it will not be tempted to manipulate humans in any way.
Unfriendly Natural Intelligence
Related to: UFAI, Paperclip maximizer, Reason as memetic immune disorder
A discussion with Stefan (cheers, didn't get your email, please message me) during the European Community Weekend Berlin fleshed out an idea I had toyed around with for some time:
If a UFAI can wreak havoc by driving simple goals to extremes then so should driving human desires to extremes cause problems. And we should already see this.
Actually we do.
We know that just following our instincts on eating (sugar, fat) is unhealthy. We know that stimulating our pleasure centers more or less directly (drugs) is dangerous. We know that playing certain games can lead to comparable addiction. And the recognition of this has led to a large number of more or less fine-tuned anti-memes e.g. dieting, early drug prevention, helplines. These memes steering us away from such behaviors were selected for because they provided aggregate benefits to the (members of) social (sub) systems they are present in.
Many of these memes have become so self-evident we don't recognize them as such. Some are essential parts of highly complex social systems. What is the general pattern? Did we catch all the critical cases? Are the existing memes well-suited for the task?How are they related. Many are probably deeply woven into our culture and traditions.
Did we miss any anti-memes?
This last question really is at the core of this post. I think we lack some necessary memes keeping new exploitations of our desires in check. Some new ones result from our society a) having developed the capacity to exploit them and b) the scientific knowledge to know how to do this.
240 questions for your utility function
A game comparing intrinsic values can approximate a person's utility function.
Caring about possible people in far Worlds
This relates to my recent post on existence in many-worlds.
I care about possible people. My child, if I ever have one, is one of them, and it seems monstrous not to care about one's children. There are many distinct ways of being a possible person. 1)You can be causally connected to some actual people in the actual world in some histories of that world. 2)You can be a counterpart of an actual person on a distinct world without causal connections 3)You can be distinct from all actual individuals, and in a causally separate possible world. 4)You can be acausally connectable to actual people, but in distinct possible worlds.
Those 4 ways are not separate partitions without overlap, sometimes they overlap, and I don't believe they exhaust the scope of possible people. The most natural question to ask is "should we care equally about about all kinds of possible people". Some people are seriously studying this, and let us hope they give us accurate ways to navigate our complex universe. While we wait, some worries seem relevant:
1) The Multiverse is Sadistic Argument:
P1.1: If all possible people do their morally relevant thing (call it exist, if you will) and
P1.2: We cannot affect (causally or acausally) what is or not possible
C1.0: Then we cannot affect the morally relevant thing.
2) The Multiverse is Paralyzing (related)
P2.1: We have reason to care about X-Risk
P2.2: Worlds where X-Risk obtains are possible
P2.3: We have nearly as much reason to worry about possible non-actual1 worlds where X-risk obtains, as we have to actual worlds where it obtains.
P2.4: There are infinitely more worlds where X-risk obtains that are possible than there are actual1
C2.0: Infinitarian Paralysis
1Actual here means belonging to the same quantum branching history as you. If you think you have many quantum successors, all of them are actual, same for predecessors, and people who inhabit your Hubble volume.
3) Reality-Fluid Can't Be All That Is Left Argument
P3.1) If all possible people do their morally relevant thing
P3.2) The way in which we can affect what is possible is by giving some subsets of it more units of reality-fluid, or quantum measure
P3.3) In fact reality-fluid is a ratio, such as a percentage of successor worlds of kind A or kind B for a particular world W
P3.4) A possible World3 with 5% reality-fluid in relation to World1 is causally indistinguishable from itself with 5 times more reality-fluid 25% in relation to World2.
P3.5) The morally relevant thing, though by constitution qualitative, seems to be quantifiable, and what matters is it's absolute quantity, not any kind of ratio.
C3.1: From 3.2 and 3.3 -> We can actually affect only a quantity that is relative to our world, not an absolute quantity.
C3.2: From C3.1 and P 3.5 -> We can't affect the relevant thing.
C3.3: We ended up having to talk about reality fluid because decisions matter, and reality fluid is the thing that decision changes (from P3.4 we know it isn't causal structure). But if all that decision changes is some ratio between worlds, and what matters by P3.5 is not a ratio between worlds, we have absolutely no clue of what we are talking about when we talk about "the thing that matters" "what we should care about" and "reality fluid".
These arguments are here not as a perfectly logical and acceptable argument structure, but to at least induce nausea about talking about Reality-Fluid, Measure, Morally relevant things in many-worlds, Morally relevant people causally disconnected to us. Those are not things you can Taboo the word away and keep the substance around. The problem does not lie in the word 'Existence', or in the sentence 'X is morally relevant'. It seems to me that the service that that existence or reality used to play doesn't make sense anymore (if all possible worlds exist or if Mathematical Universe Hypothesis is correct). We attempted to keep it around as a criterial determinant for What Matters. Yet now all that is left is this weird ratio that just can't be what matters. Without a criterial determinant for mattering, we are left in a position that makes me think we should head back towards a causal approach to morality. But this is an opinion, not a conclusion.
Edit: This post is an argument against the conjunctive truth of two things, Many Worlds, and the way in which we think of What Matters. It seems that the most natural interpretation of it is that Many Worlds is true, and thus my argument is against our notion of What Matters. In fact my position lies more in the opposite side - our notion of What Matters is (strongly related to) What Matters, so Many Worlds are less likely.
Inferring Values from Imperfect Optimizers
One approach to constructing a Friendly artificial intelligence is to create a piece of software that looks at large amounts of evidence about humans, and attempts to infer their values. I've been doing some thinking about this problem, and I'm going to talk about some approaches and problems that have occurred to me.
In a naive approach, we might define the problem like this: take some unknown utility function, U, and plug it into a mathematically clean optimization process (like AIXI) O. Then, look at your data set and take the information about the inputs and outputs of humans, and find the simplest U that best explains human behavior.
Unfortunately, this won't work. The best possible match for U is one that models not just those elements of human utility we're interested in, but also all the details of our broken, contradictory optimization process. The U we derive through this process will optimize for confirmation bias, scope insensitivity, hindsight bias, the halo effect, our own limited intelligence and inefficient use of evidence, and just about everything else that's wrong with us. Not what we're looking for.
Okay, so let's try putting a bandaid on it - let's go back to our original problem setup. However, we'll take our original O, and use all of the science on cognitive biases at our disposal to handicap it. We'll limit its search space, saddle it with a laundry list of cognitive biases, cripple its ability to use evidence, and in general make it as human-like as we possibly can. We could even give it akrasia by implementing hyperbolic discounting of reward. Then we'll repeat the original process to produce U'.
If we plug U' into our AI, the result will be that it will optimize like a human who had suddenly been stripped of all the kinds of stupidity that we programmed into our modified O. This is good! Plugged into a solid CEV infrastructure, this might even be good enough to produce a future that's a nice place to live. However, it's not quite ideal. If we miss a cognitive bias, then it'll be incorporated into the learned utility functions, and we may never be rid of it. What would be nice would be if we could get the AI to learn about cognitive biases, exhaustively, and update in the future if it ever discovered a new one.
If we had enough time and money, we could do this the hard way: acquire a representative sample of the human population, and pay them to perform tasks with simple goals under tremendous surveillance, and have the AI derive the human optimization process from the actions taken towards a known goal. However, if we assume that the human optimization process can be defined as a function over the state of the human brain, we should not trust the completeness of any such process learned from less data than the entropy of the human brain, which is on the order of tens of petabytes of extremely high quality evidence. If we want to be confident in the completeness of our model, we may need more experimental evidence than it is really practical to accumulate. Which isn't to say that this approach is useless - if we can hit close enough to the mark, then the AI may be able to run more exhaustive experimentation later and refine its own understanding of human brains to be closer to the ideal.
But it'd really be nice if our AI could do unsupervised learning to figure out the details of human optimization. Then we could simply dump the internet into it, and let it grind away at the data and spit out a detailed, complete model of human decision-making, from which our utility function could be derived. Unfortunately, this does not seem to be a tractable problem. It's possible that some insight could be gleaned by examining outliers with normal intelligence, but deviant utility functions (I am thinking specifically of sociopaths), but it's unclear how much insight can be produced by these methods. If anyone has suggestions for a more efficient way of going about it, I'd love to hear it. As it stands, it might be possible to get enough information from this to supplement a supervised learning approach - the closer we get to a perfectly accurate model, the higher the probability of Things Going Well.
Anyways, that's where I am right now. I just thought I'd put up my thoughts and see if some fresh eyes see anything I've been missing.
Cheers,
Niger
Cake, or death!
Here we'll look at the famous cake or death problem teasered in the Value loading/learning post.
Imagine you have an agent that is uncertain about its values and designed to "learn" proper values. A formula for this process is that the agent must pick an action a equal to:
- argmaxa∈A Σw∈W p(w|e,a) Σu∈U u(w)p(C(u)|w)
Let's decompose this a little, shall we? A is the set of actions, so argmax of a in A simply means that we are looking for an action a that maximises the rest of the expression. W is the set of all possible worlds, and e is the evidence that the agent has seen before. Hence p(w|e,a) is the probability of existing in a particular world, given that the agent has seen evidence e and will do action a. This is summed over each possible world in W.
And what value do we sum over in each world? Σu∈U u(w)p(C(u)|w). Here U is the set of (normalised) utility functions the agent is considering. In value loading, we don't program the agent with the correct utility function from the beginning; instead we imbue it with some sort of learning algorithm (generally with feedback) so that it can deduce for itself the correct utility function. The expression p(C(u)|w) expresses the probability that the utility u is correct in the world w. For instance, it might cover statements "it's 99% certain that 'murder is bad' is the correct morality, given that I live in a world where every programmer I ask tells me that murder is bad".
The C term is the correctness of the utility function, given whatever system of value learning we're using (note that some moral realists would insist that we don't need a C, that p(u|w) makes sense directly, that we can deduce ought from is). All the subtlety of the value learning is encoded in the various p(C(u)|w): this determines how the agent learns moral values.
So the whole formula can be described as:
- For each possible world and each possible utility function, figure out the utility of that world. Weigh that by the probability that that utility is correct is that world, and by the probability of that world. Then choose the action that maximises the weighted sum of this across all utility functions and worlds.
Naive cake or death

No Value
I am still quite new to LW, so I apologize if this is something that has been discussed before (I did try and search).
I would't normally post such a thing, as I try not to make a habit of complaining my problems to others, but a solution to this would likely benefit other rationalists (at least that's the excuse I made to myself).
Essentially, I am currently in a psychological state in which I simply have no strong values. There is no state of the world that I can imagine the world being in that generates a strong emotional reaction. Ever. In fact, I rarely experience strong emotions at all. When I do, I savor them whether they're positive or negative. I do have some preferences; I would somewhat prefer the world to be some ways than others, but never strongly. I prefer to feel pleasure rather than pain; I prefer the world to be a good place than a bad one, but not by much. Even my desire to have values seems to be a mere preference in much the same way. I have nothing to protect.
Is there any good solution to this?
Why safe Oracle AI is easier than safe general AI, in a nutshell
Moderator: "In our televised forum, 'Moral problems of our time, as seen by dead people', we are proud and privileged to welcome two of the most important men of the twentieth century: Adolf Hitler and Mahatma Gandhi. So, gentleman, if you had a general autonomous superintelligence at your disposal, what would you want it to do?"
Hitler: "I'd want it to kill all the Jews... and humble France... and crush communism... and give a rebirth to the glory of all the beloved Germanic people... and cure our blond blue eyed (plus me) glorious Aryan nation of the corruption of lesser brown-eyed races (except for me)... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and..."
Gandhi: "I'd want it to convince the British to grant Indian independence... and overturn the cast system... and cause people of different colours to value and respect one another... and grant self-sustaining livelihoods to all the poor and oppressed of this world... and purge violence from the heart of men... and reconcile religions... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and..."
Moderator: "And if instead you had a superintelligent Oracle, what would you want it to do?"
Hitler and Gandhi together: "Stay inside the box and answer questions accurately".
On the fragility of values
Programming human values into an AI is often taken to be very hard because values are complex (no argument there) and fragile. I would agree that values are fragile in the construction; anything lost in the definition might doom us all. But once coded into a utility function, they are reasonably robust.
As a toy model, let's say the friendly utility function U has a hundred valuable components - friendship, love, autonomy, etc... - assumed to have positive numeric values. Then to ensure that we don't lose any of these, U is defined as the minimum of all those hundred components.
Now define V as U, except we forgot the autonomy term. This will result in a terrible world, without autonomy or independence, and there will be wailing and gnashing of teeth (or there would, except the AI won't let us do that). Values are indeed fragile in the definition.
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)