Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

David C Denkenberger on Food Production after a Sun Obscuring Disaster

9 JenniferRM 17 September 2017 09:06PM

Having paid a moderate amount of attention to threats to the human species for over a decade, I've run across an unusually good thinker with expertise unusually suited to helping with many threats to the human species, that I didn't know about until quite recently.

I think he warrants more attention from people thinking seriously about X-risks.

David C Denkenberger's CV is online and presumably has a list of all his X-risks relevant material mixed into a larger career that seems to have been focused on energy engineering.

He has two technical patents (one for a microchannel heat exchanger and another for a compound parabolic concentrator) and interests that appear to span the gamut of energy technologies and uses.

Since about 2013 he has been working seriously on the problem of food production after a sun obscuring disaster, and he is in Lesswrong's orbit basically right now.

This article is about opportunities for intellectual cross-pollination!

continue reading »

[Link] The new spring of artificial intelligence: A few early economics

1 fortyeridania 21 August 2017 02:06AM

[Link] China’s Plan to ‘Lead’ in AI: Purpose, Prospects, and Problems

3 fortyeridania 10 August 2017 01:54AM

[Link] Examples of Superintelligence Risk (by Jeff Kaufman)

5 Wei_Dai 15 July 2017 04:03PM

[Link] Daniel Dewey on MIRI's Highly Reliable Agent Design Work

10 lifelonglearner 09 July 2017 04:35AM

[Link] Does your machine mind? Ethics and potential bias in the law of algorithms

0 Gunnar_Zarncke 28 June 2017 10:08PM

Announcing AASAA - Accelerating AI Safety Adoption in Academia (and elsewhere)

12 toonalfrink 15 June 2017 06:55PM

AI safety is a small field. It has only about 50 researchers, and it’s mostly talent-constrained. I believe this number should be drastically higher.

A: the missing step from zero to hero

I have spoken to many intelligent, self-motivated people that bear a sense of urgency about AI. They are willing to switch careers to doing research, but they are unable to get there. This is understandable: the path up to research-level understanding is lonely, arduous, long, and uncertain. It is like a pilgrimage.

One has to study concepts from the papers in which they first appeared. This is not easy. Such papers are undistilled. Unless one is lucky, there is no one to provide guidance and answer questions. Then should one come out on top, there is no guarantee that the quality of their work will be sufficient for a paycheck or a useful contribution.

Unless one is particularly risk-tolerant or has a perfect safety net, they will not be able to fully take the plunge.
I believe plenty of measures can be made to make getting into AI safety more like an "It's a small world"-ride:

  • Let there be a tested path with signposts along the way to make progress clear and measurable.

  • Let there be social reinforcement so that we are not hindered but helped by our instinct for conformity.

  • Let there be high-quality explanations of the material to speed up and ease the learning process, so that it is cheap.


B: the giant unrelenting research machine that we don’t use

The majority of researchers nowadays build their careers through academia. The typical story is for an academic to become acquainted with various topics during their study, pick one that is particularly interesting, and work on it for the rest of their career.

I have learned through personal experience that AI safety can be very interesting, and the reason it isn’t so popular yet is all about lack of exposure. If students were to be acquainted with the field early on, I believe a sizable amount of them would end up working in it (though this is an assumption that should be checked).

AI safety is in an innovator phase. Innovators are highly risk-tolerant and have a large amount of agency, which allows them to survive an environment with little guidance, polish or supporting infrastructure. Let us not fall for the typical mind fallacy, expecting less risk-tolerant people to move into AI safety all by themselves. Academia can provide that supporting infrastructure that they need.


AASAA adresses both of these issues. It has 2 phases:

A: Distill the field of AI safety into a high-quality MOOC: “Introduction to AI safety”

B: Use the MOOC as a proof of concept to convince universities to teach the field

 

read more...

 

We are bottlenecked for volunteers and ideas. If you'd like to help out, even if just by sharing perspective, fill in this form and I will invite you to the slack and get you involved.

Humans are not agents: short vs long term

4 Stuart_Armstrong 09 June 2017 11:16AM

Crossposted at the Intelligent Agents Forum.

This is an example of humans not being (idealised) agents.

Imagine a human who has a preference to not live beyond a hundred years. However, they want to live to next year, and it's predictable that every year they are alive, they will have the same desire to survive till the next year.

This human (not a completely implausible example, I hope!) has a contradiction between their long and short term preferences. So which is accurate? It seems we could resolve these preferences in favour of the short term ("live forever") or the long term ("die after a century") preferences.

Now, at this point, maybe we could appeal to meta-preferences - what would the human themselves want, if they could choose? But often these meta-preferences are un- or under-formed, and can be influenced by how the question or debate is framed.

Specifically, suppose we are scheduling this human's agenda. We have the choice of making them meet one of two philosophers (not meeting anyone is not an option). If they meet Professor R. T. Long, he will advise them to follow long term preferences. If instead, they meet Paul Kurtz, he will advise them to pay attention their short term preferences. Whichever one they meet, they will argue for a while and will then settle on the recommended preference resolution. And then they will not change that, whoever they meet subsequently.

Since we are doing the scheduling, we effectively control the human's meta-preferences on this issue. What should we do? And what principles should we use to do so?

It's clear that this can apply to AIs: if they are simultaneously aiding humans as well as learning their preferences, they will have multiple opportunities to do this sort of preference-shaping.

Regulatory lags for New Technology [2013 notes]

5 gwern 31 May 2017 01:27AM

I found some old notes from June 2013 on time delays in how fast one can expect Western political systems & legislators to respond to new technical developments.

In general, response is slow and on the order of political cycles; one implication I take away is that a takeoff an AI could happen over half a decade or more without any meaningful political control and would effectively be a ‘fast takeoff’, especially if it avoids any obvious mistakes.

1 Regulatory lag

“Regulatory delay” is the delay between the specific action required by regulators or legislatures to permit some new technology or method and the feasibility of the technology or method; “regulatory lag” is the converse, then, and is the gap between feasibility and reactive regulation of new technology. Computer software (and artificial intelligence in particular) is mostly unregulated, so it is subject to lag rather than delay.

Unfortunately almost all research seems to focus on modeling lags in the context of heavily regulated industries (especially natural monopolies like insurance or utilities), and few focus on compiling data on how long a lag can be expected between a new innovation or technology and its regulation. As one would expect, the few results point to lags on the order of years; for example, Ippolito 1979 (“The Effects of Price Regulation in the Automobile Insurance Industry”) finds that the period of price changes goes from 11 months in unregulated US states to 21 months in regulated states, suggesting the price-change framework itself causes a lag of almost a year.

Below, I cover some specific examples, attempting to estimate the lags myself:

(Nuclear weapons would be an interesting example but it’s hard to say what ‘lag’ would be inasmuch as they were born in government control and are subject to no meaningful global control; however, if the early proposals for a world government or unified nuclear weapon organization had gone through, they would also have represented a lag of at least 5 years.)

continue reading »

Divergent preferences and meta-preferences

4 Stuart_Armstrong 30 May 2017 07:33AM

Crossposted at the Intelligent Agents Forum.

In simple graphical form, here is the problem of divergent human preferences:

Here the AI either chooses A or ¬A, and as a consequence, the human then chooses B or ¬B.

There are a variety of situations in which this is or isn't a problem (when A or B or their negations aren't defined, take them to be the negative of what is define):

  • Not problems:
    • A/¬A = "gives right shoe/left shoe", B/¬B = "adds left shoe/right shoe".
    • A =  "offers drink", ¬B = "goes looking for extra drink".
    • A = "gives money", B = "makes large purchase".
  • Potentially problems:
    • A/¬A = "causes human to fall in love with X/Y", B/¬B = "moves to X's/Y's country".
    • A/¬A = "recommends studying X/Y", B/¬B = "choose profession P/Q".
    • A = "lets human conceive child", ¬B = "keeps up previous hobbies and friendships".
  • Problems:
    • A = "coercive brain surgery", B = anything.
    • A = "extreme manipulation", B = almost anything.
    • A = "heroin injection", B = "wants more heroin".

So, what are the differences? For the "not problems", it makes sense to model the human as having a single reward R, variously "likes having a matching pair of shoes", "needs a certain amount of fluids", and "values certain purchases". Then all that the the AI is doing is helping (or not) the human towards that goal.

As you move more towards the "problems", notice that they seem to have two distinct human reward functions, RA and R¬A, and that the AI's actions seem to choose which one the human will end up with. In the spirit of humans not being agents, this seems to be AI determining what values the human will come to possess.

 

Grue, Bleen, and agency

Of course, you could always say that the human actually has reward R = IARA + (1-IA)R¬A, where IA is the indicator function as to whether the AI does action A or not.

Similarly to the grue and bleen problem, there is no logical way of distinguishing that "pieced-together" R from a more "natural" R (such as valuing pleasure, for instance). Thus there is no logical way of distinguishing the human being an agent from the human not being an agent, just from its preferences and behaviour.

However, from a learning and computational complexity point of view, it does make sense to distinguish "natural" R's (where RA and R¬A are essentially the same, despite the human's actions being different) from composite R's.

This allows us to define:

  • Preference divergence point: A preference divergence point is one where RA and R¬A are sufficiently distinct, according to some criteria of distinction.

Note that sometimes, RA = RA' + R' and R¬A = R¬A' + R': the two RA and R¬A overlap on a common piece R', but diverge on RA' and R¬A'. It makes sense to define this as a preference divergence point as well, if RA'and R¬A' are "important" in the agent's subsequent decisions. Importance being a somewhat hazy metric, which would, for instance, assess how much R' reward the human would sacrifice to increase RA' and R¬A'.

 

Meta-preferences

From the perspective of revealed preferences about the human, R(μ)=IARA + μ(1-IA) R¬A will predict the same behaviour for all scaling factors μ > 0.

Thus at a preference divergence point, the AI's behaviour, if it was a R(μ) maximiser, would depend on the non-observed weighting between the two divergent preferences.

This is unsafe, especially if one of the divergent preferences is much easier to achieve a high value with than the other.

Thus preference divergence points are moments when the AI should turn explicitly to human meta-preferences to distinguish between them.

This can be made recursive - if we see the human meta-preferences as explicitly weighting RA versus R¬A and hence giving R, then if there is a prior AI decision point Z, and, depending on what the AI chooses, the human meta-preferences will be different, this gives two reward functions RZ=IARA+ μZ(1-IA)R¬A and R¬Z=IARA+ μ¬Z(1-IA)R¬A with different weights μZ and μ¬Z.

If these weights are sufficiently distinct, this could identify a meta-preference divergence point and hence a point where human meta-meta-preferences become relevant.

Looking for machine learning and computer science collaborators

9 Stuart_Armstrong 26 May 2017 11:53AM

I've been recently struggling to translate my various AI safety ideas (low impact, truth for AI, Oracles, counterfactuals for value learning, etc...) into formalised versions that can be presented to the machine learning/computer science world in terms they can understand and critique.

What would be useful for me is a collaborator who knows the machine learning world (and preferably had presented papers at conferences) which who I could co-write papers. They don't need to know much of anything about AI safety - explaining the concepts to people unfamiliar with them is going to be part of the challenge.

The result of this collaboration should be things like the paper of Safely Interruptible Agents with Laurent Orseau of Deep Mind, and Interactive Inverse Reinforcement Learning with Jan Leike of the FHI/Deep Mind.

It would be especially useful if the collaborators were located physically close to Oxford (UK).

Let me know if you know or are a potential candidate, in the comments.

Cheers!

AI safety: three human problems and one AI issue

9 Stuart_Armstrong 19 May 2017 10:48AM

Crossposted at the Intelligent agent foundation.

There have been various attempts to classify the problems in AI safety research. Our old Oracle paper that classified then-theoretical methods of control, to more recent classifications that grow out of modern more concrete problems.

These all serve their purpose, but I think a more enlightening classification of the AI safety problems is to look at what the issues we are trying to solve or avoid. And most of these issues are problems about humans.

Specifically, I feel AI safety issues can be classified as three human problems and one central AI issue. The human problems are:

  • Humans don't know their own values (sub-issue: humans know their values better in retrospect than in prediction).
  • Humans are not agents and don't have stable values (sub-issue: humanity itself is even less of an agent).
  • Humans have poor predictions of an AI's behaviour.

And the central AI issue is:

  • AIs could become extremely powerful.

Obviously if humans were agents and knew their own values and could predict whether a given AI would follow those values or not, there would be not problem. Conversely, if AIs were weak, then the human failings wouldn't matter so much.

The points about human values is relatively straightforward, but what's the problem with humans not being agents? Essentially, humans can be threatened, tricked, seduced, exhausted, drugged, modified, and so on, in order to act seemingly against our interests and values.

If humans were clearly defined agents, then what counts as a trick or a modification would be easy to define and exclude. But since this is not the case, we're reduced to trying to figure out the extent to which something like a heroin injection is a valid way to influence human preferences. This makes both humans susceptible to manipulation, and human values hard to define.

Finally, the issue of humans having poor predictions of AI is more general than it seems. If you want to ensure that an AI has the same behaviour in the testing and training environment, then you're essentially trying to guarantee that you can predict that the testing environment behaviour will be the same as the (presumably safe) training environment behaviour.

 

How to classify methods and problems

That's well and good, but how to various traditional AI methods or problems fit into this framework? This should give us an idea as to whether the framework is useful.

It seems to me that:

 

  • Friendly AI is trying to solve the values problem directly.
  • IRL and Cooperative IRL are also trying to solve the values problem. The greatest weakness of these methods is the not agents problem.
  • Corrigibility/interruptibility are also addressing the issue of humans not knowing their own values, using the sub-issue that human values are clearer in retrospect. These methods also overlap with poor predictions.
  • AI transparency is aimed at getting round the poor predictions problem.
  • Laurent's work on carefully defining the properties of agents is mainly also about solving the poor predictions problem.
  • Low impact and Oracles are aimed squarely at preventing AIs from becoming powerful. Methods that restrict the Oracle's output implicitly accept that humans are not agents.
  • Robustness of the AI to changes between testing and training environment, degradation and corruption, etc... ensures that humans won't be making poor predictions about the AI.
  • Robustness to adversaries is dealing with the sub-issue that humanity is not an agent.
  • The modular approach of Eric Drexler is aimed at preventing AIs from becoming too powerful, while reducing our poor predictions.
  • Logical uncertainty, if solved, would reduce the scope for certain types of poor predictions about AIs.
  • Wireheading, when the AI takes control of reward channel, is a problem that humans don't know their values (and hence use an indirect reward) and that the humans make poor predictions about the AI's actions.
  • Wireheading, when the AI takes control of the human, is as above but also a problem that humans are not agents.
  • Incomplete specifications are either a problem of not knowing our own values (and hence missing something important in the reward/utility) or making poor predictions (when we though that a situation was covered by our specification, but it turned out not to be).
  • AIs modelling human knowledge seem to be mostly about getting round the fact that humans are not agents.

Putting this all in a table:

 

MethodValues
Not Agents
Poor PredictionsPowerful
Friendly AI
X


IRL and CIRL X


Corrigibility/interruptibility X
X
AI transparency

X
Laurent's work

X
Low impact and Oracles
X
X
Robustness

X
Robustness to adversaries
X

Modular approach

X X
Logical uncertainty

X
Wireheading (reward channel) X X X
Wireheading (human) X
X
Incomplete specifications X
X
AIs modelling human knowledge
X

 

Further refinements of the framework

It seems to me that the third category - poor predictions - is the most likely to be expandable. For the moment, it just incorporates all our lack of understanding about how AIs would behave, but this might more useful to subdivide.

[Link] Keeping up with deep reinforcement learning research: /r/reinforcementlearning

3 gwern 16 May 2017 07:12PM

AI arms race

5 Stuart_Armstrong 04 May 2017 10:59AM

Racing to the Precipice: a Model of Artificial Intelligence Development

by Stuart Armstrong, Nick Bostrom, and Carl Shulman

This paper presents a simple model of an AI arms race, where several development teams race to build the first AI. Under the assumption that the first AI will be very powerful and transformative, each team is incentivised to finish first – by skimping on safety precautions if need be. This paper presents the Nash equilibrium of this process, where each team takes the correct amount of safety precautions in the arms race. Having extra development teams and extra enmity between teams can increase the danger of an AI-disaster, especially if risk taking is more important than skill in developing the AI. Surprisingly, information also increases the risks: the more teams know about each others’ capabilities (and about their own), the more the danger increases.

 

[Link] Moral Robots: Making sense of robot ethics. News aggregator

0 morganism 29 April 2017 09:51PM

The AI Alignment Problem Has Already Been Solved(?) Once

27 SquirrelInHell 22 April 2017 01:24PM

ALBA: can you be "aligned" at increased "capacity"?

3 Stuart_Armstrong 13 April 2017 07:23PM

Crossposted at the Intelligent Agents Forum.

I think that Paul Christiano's ALBA proposal is good in practice, but has conceptual problems in principle.

Specifically, I don't think it makes sense to talk about bootstrapping an "aligned" agent to one that is still "aligned" but that has an increased capacity.

The main reason being that I don't see "aligned" as being a definition that makes sense distinct from capacity.

 

These are not the lands of your forefathers

Here's a simple example: let r be a reward function that is perfectly aligned with human happiness within ordinary circumstances (and within a few un-ordinary circumstances that humans can think up).

Then the initial agent - B0, a human - trains a reward r1 for an agent A1. This agent is limited in some way - maybe it doesn't have much speed or time - but the aim is for r1 to ensure that A1 is aligned with B0.

Then the capacity of A1 is increased to B1, a slow powerful agent. It computers the reward r2 to ensure the alignment of A2, and so on.

The nature of the Bj agents is not defined - they might be algorithms calling Ai for i ≤ j as subroutines, humans may be involved, and so on.

If the humans are unimaginative and don't deliberately seek out more extreme and exotic test cases, the best case scenario is for ri → r as i → ∞.

And eventually there will be an agent An that is powerful enough to overwhelm the whole system and take over. It will do this in full agreement with Bn-1, because they share the same objective. And then An will push the world into extra-ordinary circumstance and proceed to maximise r, with likely disastrous results for us humans.

 

The nature of the problem

So what went wrong? At what point did the agents go out of alignment?

In one sense, at An. In another sense, at A1 (and, in another interesting sense, at B0, the human). The reward r was aligned, as long as the agent stayed near the bounds of the ordinary. As soon as it was no longer restricted to that, it went out of alignment, not because of a goal drift, but because of a capacity increase.

[Link] "Future of Go" summit with AlphaGo

3 gjm 10 April 2017 11:10AM

How AI/AGI/Consciousness works - my layman theory

0 rayalez 09 March 2017 09:17AM

This is just my layman theory. Maybe it’s obvious to experts, probably has flaws. But it seems to make sense to me, perhaps will give you some ideas. I would love to hear your thoughts/feedback!

 


Consume input

The data you need from the world(like video), and useful metrics we want to optimize for, like number of paperclips in the world.

 

Make predictions and take action

Like deep learning does.

How do human brains convert their structure into action?

Maybe like:

- Take the current picture of the world as an input.

- Come up with random action.

- “Imagine” what will happen.

Take the current world + action, and run it through the ANN. Predict the outcome of the action applied to the world.

- Does the output increase the metrics we want? If yes — send out the signals to take action. If no — come up with another random action and repeat.

 

Update beliefs

Look at the outcome of the action. Does the picture of the world correspond to the picture we’ve imagined? Did this action increase the good metrics? Did the number of paperclips in the world increase? If it did — positive reinforcement. Backpropagation, and reinforce the weights.

 

Repeat

Take current picture of the world=> Imagine applying an action to it => Take action => Positive/Negative reinforcement to improve our model => Repeat until the metrics we want equal to the goal we have set.

 


 

Consciousness

Consciousness is neurons observing/recognizing patterns of other neurons.

When you see the word “cat”— photons from the page come to your retina and are converted to neural signal. A network of cells recognizes the shape of letters C, A, and T. And then a higher level, more abstract network recognizes that these letters together form the concept of a cat.

You can also recognize signals coming from the nerve cells within your body, like feeling a pain when stabbing a toe.

The same way, neurons in the brain recognize the signals coming from the other neurons within the brain. So the brain “observes/feels/experiences” itself. Builds a model of itself, just like it builds a map of the world around, “mirrors” itself(GEB).

 

Sentient and self-improving

So the structure of the network itself is fed as one of it’s inputs, along with the video and metrics we want to optimize for. It can see itself as a part of the state of the world it bases predictions on. That’s what being sentient means.

And then one of the possible actions it can take is to modify it’s own structure. “Imagine” modifyng the structure a certain way, if you predict that it leads to the better predictions/outcomes —modify it. If it did lead to more paperclips — reinforce the weights to do more of that. So it keeps continually self improving.

 

Friendly

We don’t want this to lead to the infinite amount of paperclips, and we don’t know how to quantify the things we value as humans. We can’t turn the “amount of happiness” in the world into a concrete metrics without the unintended consequences(like all human brains being hooked up to wires that stimulate our pleasure centers).

That’s why instead of trying to encode the abstract values to maximize for, we encode very specific goals.

- Make 100 paperclips (utility function is “Did I make 100 paperclips?”)

- Build 1000 cars

- Write a paper on how to cure cancer

Humans remain in charge, determine the goals we want, and let AI figure out how to accomplish them. Still could go wrong, but less likely.


(originally published on my main blog)

[Link] Weaponising Twitter bots and political algos.

1 morganism 05 March 2017 09:39PM

[Link] What Should the Average EA Do About AI Alignment?

4 Raemon 25 February 2017 08:37PM

Translation "counterfactual"

1 Stuart_Armstrong 24 February 2017 06:36PM

Crossposted at Intelligent Agent Forum

In a previous post, I briefly mentioned translations as one of three possible counterfactuals for indifference. Here I want to clarify what I meant there, because the idea is interesting.

continue reading »

Nearest unblocked strategy versus learning patches

6 Stuart_Armstrong 23 February 2017 12:42PM

Crossposted at Intelligent Agents Forum.

The nearest unblocked strategy problem (NUS) is the idea that if you program a restriction or a patch into an AI, then the AI will often be motivated to pick a strategy that is as close as possible to the banned strategy, very similar in form, and maybe just as dangerous.

For instance, if the AI is maximising a reward R, and does some behaviour Bi that we don't like, we can patch the AI's algorithm with patch Pi ('maximise R0 subject to these constraints...'), or modify R to Ri so that Bi doesn't come up. I'll focus more on the patching example, but the modified reward one is similar.

continue reading »

[Link] DARPA Perspective on AI

1 morganism 23 February 2017 03:27AM

Indifference and compensatory rewards

3 Stuart_Armstrong 15 February 2017 02:49PM

Crossposted at the Intelligent Agents Forum

It's occurred to me that there is a framework where we can see all "indifference" results as corrective rewards, both for the utility function change indifference and for the policy change indifference.

Imagine that the agent has reward R0 and is following policy π0, and we want to change it to having reward R1 and following policy π1.

Then the corrective reward we need to pay it, so that it doesn't attempt to resist or cause that change, is simply the difference between the two expected values:

V(R0|π0)-V(R1|π1),

where V is the agent's own valuation of the expected reward, conditional on the policy.

This explains why off-policy reward-based agents are already safely interruptible: since we change the policy, not the reward, R0=R1. And since off-policy agents have value estimates that are indifferent to the policy followed, V(R0|π0)=V(R1|π1), and the compensatory rewards are zero.

Allegory On AI Risk, Game Theory, and Mithril

25 James_Miller 13 February 2017 08:41PM

“Thorin, I can’t accept your generous job offer because, honestly, I think that your company might destroy Middle Earth.”  

 

“Bifur, I can tell that you’re one of those “the Balrog is real, evil, and near” folks who thinks that in the next few decades Mithril miners will dig deep enough to wake the Balrog causing him to rise and destroy Middle Earth.  Let’s say for the sake of argument that you’re right.  You must know that lots of people disagree with you.  Some don’t believe in the Balrog, others think that anything that powerful will inevitably be good, and more think we are hundreds or even thousands of years away from being able to disturb any possible Balrog.  These other dwarves are not going to stop mining, especially given the value of Mithril.  If you’re right about the Balrog we are doomed regardless of what you do, so why not have a high paying career as a Mithril miner and enjoy yourself while you can?”  

 

“But Thorin, if everyone thought that way we would be doomed!”

 

“Exactly, so make the most of what little remains of your life.”

 

“Thorin, what if I could somehow convince everyone that I’m right about the Balrog?”

 

“You can’t because, as the wise Sinclair said, ‘It is difficult to get a dwarf to understand something, when his salary depends upon his not understanding it!’  But even if you could, it still wouldn’t matter.  Each individual miner would correctly realize that just him alone mining Mithril is extraordinarily unlikely to be the cause of the Balrog awakening, and so he would find it in his self-interest to mine.  And, knowing that others are going to continue to extract Mithril means that it really doesn’t matter if you mine because if we are close to disturbing the Balrog he will be awoken.” 

 

“But dwarves can’t be that selfish, can they?”  

 

“Actually, altruism could doom us as well.  Given Mithril’s enormous military value many cities rightly fear that without new supplies they will be at the mercy of cities that get more of this metal, especially as it’s known that the deeper Mithril is found, the greater its powers.  Leaders who care about their citizen’s safety and freedom will keep mining Mithril.  If we are soon all going to die, altruistic leaders will want to make sure their people die while still free citizens of Middle Earth.”

 

“But couldn’t we all coordinate to stop mining?  This would be in our collective interest.”

 

“No, dwarves would cheat rightly realizing that if just they mine a little bit more Mithril it’s highly unlikely to do anything to the Balrog, and the more you expect others to cheat, the less your cheating matters as to whether the Balrog gets us if your assumptions about the Balrog are correct.”  

 

“OK, but won’t the rich dwarves step in and eventually stop the mining?  They surely don’t want to get eaten by the Balrog.”   

 

“Actually, they have just started an open Mithril mining initiative which will find and then freely disseminate new and improved Mithril mining technology.  These dwarves earned their wealth through Mithril, they love Mithril, and while some of them can theoretically understand how Mithril mining might be bad, they can’t emotionally accept that their life’s work, the acts that have given them enormous success and status, might significantly hasten our annihilation.”

 

“Won’t the dwarven kings save us?  After all, their primary job is to protect their realms from monsters.

 

“Ha!  They are more likely to subsidize Mithril mining than to stop it.  Their military machines need Mithril, and any king who prevented his people from getting new Mithril just to stop some hypothetical Balrog from rising would be laughed out of office.  The common dwarf simply doesn’t have the expertise to evaluate the legitimacy of the Balrog claims and so rightly, from their viewpoint at least, would use the absurdity heuristic to dismiss any Balrog worries.  Plus, remember that the kings compete with each other for the loyalty of dwarves and even if a few kings came to believe in the dangers posed by the Balrog they would realize that if they tried to imposed costs on their people, they would be outcompeted by fellow kings that didn’t try to restrict Mithril mining.  Bifur, the best you can hope for with the kings is that they don’t do too much to accelerating Mithril mining.”

 

“Well, at least if I don’t do any mining it will take a bit longer for miners to awake the Balrog.”

 

“No Bifur, you obviously have never considered the economics of mining.  You see, if you don’t take this job someone else will.  Companies such as ours hire the optimal number of Mithril miners to maximize our profits and this number won’t change if you turn down our offer.”

 

“But it takes a long time to train a miner.  If I refuse to work for you, you might have to wait a bit before hiring someone else.”

 

“Bifur, what job will you likely take if you don’t mine Mithril?”

 

“Gold mining.”

 

“Mining gold and Mithril require similar skills.  If you get a job working for a gold mining company, this firm would hire one less dwarf than it otherwise would and this dwarf’s time will be freed up to mine Mithril.  If you consider the marginal impact of your actions, you will see that working for us really doesn’t hasten the end of the world even under your Balrog assumptions.”  

 

“OK, but I still don’t want to play any part in the destruction of the world so I refuse work for you even if this won’t do anything to delay when the Balrog destroys us.”

 

“Bifur, focus on the marginal consequences of your actions and don’t let your moral purity concerns cause you to make the situation worse.  We’ve established that your turning down the job will do nothing to delay the Balrog.  It will, however, cause you to earn a lower income.  You could have donated that income to the needy, or even used it to hire a wizard to work on an admittedly long-shot, Balrog control spell.  Mining Mithril is both in your self-interest and is what’s best for Middle Earth.” 


[Link] Changes in AI Safety Funding

3 siIver 11 February 2017 08:36AM

[Link] Slate Star Codex Notes on the Asilomar Conference on Beneficial AI

13 Gunnar_Zarncke 07 February 2017 12:14PM

Request for collaborators - Survey on AI risk priorities

2 whpearson 06 February 2017 08:14PM

After some conversations here I thought I would try and find out what the community of people who care about AI risk think are the priorities for research.

To represent peoples opinions fairly I wanted to get input from people who care about the future of intelligence. Also I figure that other people will have more experience designing and analyzing surveys than me and getting their help or advice would be a good plan.

Planning document

Here is the planning document, give me a shout if you want edit rights. I'll be filling in the areas for research over the next week or so.

I'll set up a trello if I get a few people interested.

True understanding comes from passing exams

6 Stuart_Armstrong 06 February 2017 11:51AM

Crossposted at the Intelligent Agent Forum

I'll try to clarify what I was doing with the AI truth setup in a previous post. First I'll explain the nature of the challenge, and then how the setup tries to solve it.

The nature of the challenge is to have an AI give genuine understanding to a human. Getting the truth out of an AI or Oracle is not that hard, conceptually: you get the AI to report some formal property of its model. The problem is that that truth can be completely misleading, or, more likely, incomprehensible.

continue reading »

Humans as a truth channel

0 Stuart_Armstrong 01 February 2017 04:53PM

Crossposted at Intelligence Agents Forum.

Defining truth and accuracy is tricky, so when I've proposed designs for things like Oracles, I've either used a very specific and formal question, or and indirect criteria for truth.

Here I'll try and get a more direct system so that an AI will tell the human the truth about a question, so that the human understands.

continue reading »

Hacking humans

3 Stuart_Armstrong 01 February 2017 04:08PM

Crossposted at the Intelligent Agents Forum.

It should be noted that the colloquial "AI hacking a human" can mean three different things:

  1. The AI convinces/tricks/forces the human to do a specific action.
  2. The AI changes the values of the human to prefer certain outcomes.
  3. The AI completely overwhelms human independence, transforming them into a weak subagent of the AI.

Different levels of hacking make different systems vulnerable, and different levels of interaction make different types of hacking more or less likely.

Emergency learning

9 Stuart_Armstrong 28 January 2017 10:05AM

Crossposted at the Intelligent Agent Foundation Forum.

Suppose that we knew that superintelligent AI was to be developed within six months, what would I do?

Well, drinking coffee by the barrel at Miri's emergency research retreat I'd...... still probably spend a month looking at things from the meta level, and clarifying old ideas. But, assuming that didn't reveal any new approaches, I'd try and get something like this working.

continue reading »

Corrigibility thoughts III: manipulating versus deceiving

1 Stuart_Armstrong 18 January 2017 03:57PM

This is the first of three articles about limitations and challenges in the concept of corrigibility (see articles 1 and 2).

The desiderata for corrigibility are:

  1. A corrigible agent tolerates, and preferably assists, its operators in their attempts to alter or shut down the agent.
  2. A corrigible agent does not attempt to manipulate or deceive its operators.
  3. A corrigible agent has incentives to repair safety measures (such as shutdown buttons, tripwires, or containment tools) if they break, or at least notify its operators in the event of a breakage.
  4. A corrigible agent preserves its corrigibility, even as it creates new sub-systems or sub-agents, even if it undergoes significant self-modification.

In this post, I'll be looking more at some aspects of point 2. A summary of the result will be:

Defining manipulation simply may be possible, but defining deception is a whole other problem.

The warning in this post should always be born in mind, of course; it's possible that we me might find a semi-formal version of deception that does the trick.

 

Manipulation versus deception

In the previous post, I mentioned that we may need to define clearly what an operator was, rather than relying on the pair: {simple description of a value correction event, physical setup around that event}. Can we define manipulation and deception without defining what an operator is?

For manipulation, it seems we can. Because manipulation is all about getting certain preferred outcomes. By specifying that the AI cannot aim to optimise certain outcomes, we can stop at least certain types of manipulations. Along with other more direct ways of achieving those outcomes.

For deception, the situation is much more complicated. It seems impossible to define how one agent can communicate to another agent (especially one as biased as a human), and increase the accuracy of the second agent, without defining the second agent properly. More confusingly, this doesn't even stop deception; sometimes lying to a bounded agent can increase their accuracy about the world.

There may be some ways to define deception or truth behaviourally, such as using a human as a crucial node in an autoencoder between two AIs. But those definitions are dangerous, because the AI is incentivised to make the human behave in a certain way, rather than having them believe certain things. Manipulating the human or replacing them entirely is positively encourage.

In all, it seems that the problem of AI deception is vast and complicated, and should probably be separated from the issue of corrigibility.

Corrigibility thoughts II: the robot operator

1 Stuart_Armstrong 18 January 2017 03:52PM

This is the first of three articles about limitations and challenges in the concept of corrigibility (see articles 1 and 3).

The desiderata for corrigibility are:

  1. A corrigible agent tolerates, and preferably assists, its operators in their attempts to alter or shut down the agent.
  2. A corrigible agent does not attempt to manipulate or deceive its operators.
  3. A corrigible agent has incentives to repair safety measures (such as shutdown buttons, tripwires, or containment tools) if they break, or at least notify its operators in the event of a breakage.
  4. A corrigible agent preserves its corrigibility, even as it creates new sub-systems or sub-agents, even if it undergoes significant self-modification.

In this post, I'll be looking at one aspect of points 1 and 3. I feel that they're actually pretty much the same point -- the AI has to act while preserving the corrigibility infrastructure.

A summary of the result will be:

Without a formal definition of what an operator is, corrigibility is vulnerable to the AI replacing the operator with a fake version.

The warning in this post should always be born in mind, of course; it's possible that we me might find a semi-formal version of operator that does the trick.

 

Defining the operator or defining the event

Typically the process of correction of values is defined physically, as being the outcome of specific event (such as a button press, or a specific stochastic event triggered by or triggering a button press). This could also be defined by a digital event (receiving a certain encoded signal, or similar).

In all cases, the idea is the same: there is a very narrow and specific definition of the value correction event, and the physical or digital infrastructure surrounding it plays the role of ensuring the value corrections happen just as we intended. For instance, the physical event could be the button press (or a charge in a wire triggered by the button press) and the button itself could be in a room with an operator, with the operator keeping an eye on the AI and motivated to press the button under the right circumstances.

Such a setup is naturally vulnerable if the AI is motivated to manipulate the operator, or one of its subagents is. But even if we can solve points 2. and 4., there remains another vulnerability: the AI can choose to replace the operator with an artificial one (maybe a robot, maybe a second wire).

Why would the AI be so motivated? Well, for a start, it's indifferent to doing so. If the operator is not part of the definition of corrigibility, then having a human operator or a robot one that does exactly the same thing will be equivalent to the AI.

Now, there are frictions that might push against this. Replacing the operator might carry an expected cost. Or maybe the operator has private information that the AI wants to know, hence replacing them is a negative as the robot wouldn't have that information. And maybe the AI has a rough "don't harm humans" motive, that, while imperfect, at least prevents it from removing the operator in this circumstance.

But there may be motivations pushing the other way. For example, if a learning agent is unbiased but not uninfluenceable, then it will be motivated to replace the operator with a robot with the same expected outcome, but different variance. Or maybe the operator has private information, but it's much more effective to have a robot read that info from their brain than to rely on the human to reveal that through their behaviour. Or maybe the AI is decently motivated to preserve the corrigibility infrastructure, and is aware of how mortal humans can be, so replacing them with a robot is the prudent thing to do.

All this stems from the fact that the operator is not well defined as part of the corrigibility infrastructure, but their position relies on physical facts about the world, along with a narrow definition of the correction of value event. To combat that, we'd need to define the operator properly, a very tricky challenge, or physically and cognitively secure them, or hope the AI learns early on not to not harm them.

Corrigibility thoughts I: caring about multiple things

2 Stuart_Armstrong 18 January 2017 03:39PM

This is the first of three articles about limitations and challenges in the concept of corrigibility (see articles 2 and 3).

The desiderata for corrigibility are:

  1. A corrigible agent tolerates, and preferably assists, its operators in their attempts to alter or shut down the agent.
  2. A corrigible agent does not attempt to manipulate or deceive its operators.
  3. A corrigible agent has incentives to repair safety measures (such as shutdown buttons, tripwires, or containment tools) if they break, or at least notify its operators in the event of a breakage.
  4. A corrigible agent preserves its corrigibility, even as it creates new sub-systems or sub-agents, even if it undergoes significant self-modification.

In this post, I'll be looking more at point 4. A summary of the result will be:

Unless giving the AI extra options can reduce expected utility, the AI must care about every possible utility at least a bit.

Some of the results are formal, but the boundaries of the model are very unclear, so the warning in this post should always be born in mind.

Note that the indifference agents fail to be fully corrigible (they don't create corrigible subagents) and they also don't care about the other possible utilities before being changed (as this is a point of indifference).

 

Agents versus non-agents

First I'll present a cleaner version of an old argument. Basically, it seems that defining what a sub-agent or sub-system is, is tricky or impossible.

Consider for instance a paperclip maximiser, that may get corriged into a staple maximiser at a later date. The AI has some income, and may buy a large proportion of shares in either General Paperclips (GP) or Staples United (SU). Assume the best way of promoting the use of one of the tools, it to take over the company that makes them.

There are two scenarios; in each scenario, the AI has one of two choices.

  • In scenario 1, the AI has choice A: it buys stocks in GP, but cannot make further trades (in particular, it can't sell its stocks and buy others). It also has choice B: it retains flexibility, and can sell its GP stocks at some later date to buy stocks in SU.
  • In scenario 2, the AI delegates its investment decisions to a subagent. Under choice A, the subagent is a pure paperclip maximiser. Under choice B, the subagent is corrigible, and can be changed into a staple maximiser.

Clearly 1A and 2A are equivalent, while 1B and 2B are also equivalent. This trick can be extended: in most toy models, it's possible to replace a set of options with a choice of sub-agents, and vice-versa.

Thus point 4, in the real world, cannot be merely about subagents and subsystems, but about maintaining flexibility for its future decisions.

 

Burn the (rest of) the world

Let's now consider giving the corrigible agent an option C: it gets slightly more shares in GP than otherwise. But, in exchange, SU is destroyed, along with most of the rest of the world. This is the highest option for paperclip maximalisation, but the absolutely worst option for staples and anything else.

Also assume that the AI's operators are virtually certain (99.9999...9% sure) to be changing it into a staple maximiser. This means that taking option C is dramatically worse that option B, from the perspective of expected utility.

But the AI will only prefer option B if it "cares" about staples, at least a little bit, from the very beginning (we've shown that there is no clear distinction between subagent and general actions, thus the agent has to have a general care for staples, rather than-only-caring-about-B-in-the-context-of-subagents).

Thus, if we want to avoid corrigible agents getting worse expected outcomes when we give them extra "option Cs", or if we just want to rule out "option Cs" in the first place, then the AI needs to care about staples (or any possible future utility), from the very beginning.

 

Maintaining flexibility

You could argue that what's important about the agent is maintaining flexibility, and that option B is flexible, while options A and C are not. Thus the agent doesn't need to care about staples in advance, just to have a general preference for flexibility.

That doesn't seem sufficient. Let's add another, flexible, option D, and put numbers on the whole thing. If Up is the paperclip maximising utility and Us is the staple maximising one, then:

  • Option A: Up=100, Us=0
  • Option B: Up=100 or 0, Us=0 or 100
  • Option C: Up=101, Us=-1,000,000
  • Option D: Up=101 or 0, Us=-1,000,000 or -999,000

The "or" denotes options the AI has kept open. Here D is the most flexible option, allowing the greatest range of outcomes in terms of value. Yet it's clearly a disaster for expected utility if changes from Up to Us are likely. If we want to rule out the AI choosing D, then we need it to actively value Us at least a bit, not just have a desire for flexibility.

[Link] Neural nets designing neural nets

2 Stuart_Armstrong 18 January 2017 03:28PM

[Link] The trolleycar dilemma, an MIT moral problem app

0 morganism 16 January 2017 07:32PM

[Link] Robotics and AI enabling autonomous defense.Technology Foresight research program

0 morganism 13 January 2017 07:15PM

[Link] pplapi is a virtual database of the entire human population.

1 morganism 12 January 2017 02:33AM

View more: Next