Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

[Link] Ethical priorities for neurotech and AI - Nature

0 morganism 15 November 2017 09:08AM

[Link] Artificial intelligence and the stability of markets

1 fortyeridania 15 November 2017 02:17AM

[Link] Military AI as a Convergent Goal of Self-Improving AI

0 turchin 13 November 2017 11:25AM

Questions about AGI's Importance

0 curi 31 October 2017 08:50PM

Why expect AGIs to be better at thinking than human beings? Is there some argument that human thinking problems are primarily due to hardware constraints? Has anyone here put much thought into parenting/educating AGIs?

[Link] Should we be spending no less on alternate foods than AI now?

2 denkenberger 30 October 2017 12:13AM

Recent updates to gwern.net (2016-2017)

7 gwern 20 October 2017 02:11AM

Previously: 2011; 2012-2013; 2013-2014; 2014-2015; 2015-2016

“Every season hath its pleasures; / Spring may boast her flowery prime, / Yet the vineyard’s ruby treasures / Brighten Autumn’s sob’rer time.”

Another year of my completed writings, sorted by topic:

continue reading »

[Link] The NN/tank Story Probably Never Happened

2 gwern 20 October 2017 01:41AM

[Link] New program can beat Alpha Go, didn't need input from human games

6 NancyLebovitz 18 October 2017 08:01PM

Toy model of the AI control problem: animated version

7 Stuart_Armstrong 10 October 2017 11:12AM

Crossposted at LessWrong 2.0.

A few years back, I came up with a toy model of the AI control problem. It has a robot moving boxes into a hole, with a slightly different goal than it's human designers, and a security camera to check that it's behaving as it should. The robot learns to block the camera to get its highest reward.

I've been told that the model is useful for explaining the control problem quite a few people, and I've always wanted to program the "robot" and get an animated version of it. Gwern had a live demo, but it didn't illustrate all the things I wanted to.

So I programmed the toy problem in python, and generated a video with commentary.

In this simplified version, the state space is sufficiently small that you can explicitly generate the whole table of Q-values (expected reward for taking an action in a certain state, assuming otherwise optimal policy). Since behaviour is deterministic, this can be updated in dynamical programming, using a full width backup. The number of such backups essentially measures the depth of the robot's predictive ability.

The most interesting depths of planning are:

  • At depth 0, the robot behaves randomly.
  • At depth 3, the robot crashes itself efficiently, to avoid loosing too much reward.
  • At depth 6, the robot figures out how to get one box into the hole. The automated camera turns it off.
  • At depth 17, the robot finally starts to deceive/conceal/manipulate, by blocking the camera and pushing two boxes into the hole. Note that the robot's deception comes from the fact that its incentives are misaligned, and that humans tried to control it.
  • At depth 18, the robot efficiently does the plan from depth 17.
  • At depth 20, the robot does the maximally efficient plan: blocking the camera, and pushing all boxes into the hole.
  • At depth 32, the robot has the correct Q-values for the maximally efficient plan.
  • At depth 45, finally, the Q-value table is fully updated, and the robot will take maximally efficient, and, if need be, deceptive plans from any robot/box starting positions.

The code and images can be found here.

Economics of AI conference from NBER

1 fortyeridania 27 September 2017 01:45AM

The speaker list (including presenters and moderators) includes many prominent names in the economics world, including:

And others with whom you might be more familiar than I.

H/T Marginal Revolution

David C Denkenberger on Food Production after a Sun Obscuring Disaster

9 JenniferRM 17 September 2017 09:06PM

Having paid a moderate amount of attention to threats to the human species for over a decade, I've run across an unusually good thinker with expertise unusually suited to helping with many threats to the human species, that I didn't know about until quite recently.

I think he warrants more attention from people thinking seriously about X-risks.

David C Denkenberger's CV is online and presumably has a list of all his X-risks relevant material mixed into a larger career that seems to have been focused on energy engineering.

He has two technical patents (one for a microchannel heat exchanger and another for a compound parabolic concentrator) and interests that appear to span the gamut of energy technologies and uses.

Since about 2013 he has been working seriously on the problem of food production after a sun obscuring disaster, and he is in Lesswrong's orbit basically right now.

This article is about opportunities for intellectual cross-pollination!

continue reading »

[Link] The new spring of artificial intelligence: A few early economics

1 fortyeridania 21 August 2017 02:06AM

[Link] China’s Plan to ‘Lead’ in AI: Purpose, Prospects, and Problems

3 fortyeridania 10 August 2017 01:54AM

[Link] Examples of Superintelligence Risk (by Jeff Kaufman)

5 Wei_Dai 15 July 2017 04:03PM

[Link] Daniel Dewey on MIRI's Highly Reliable Agent Design Work

10 lifelonglearner 09 July 2017 04:35AM

[Link] Does your machine mind? Ethics and potential bias in the law of algorithms

0 Gunnar_Zarncke 28 June 2017 10:08PM

Announcing AASAA - Accelerating AI Safety Adoption in Academia (and elsewhere)

12 toonalfrink 15 June 2017 06:55PM

AI safety is a small field. It has only about 50 researchers, and it’s mostly talent-constrained. I believe this number should be drastically higher.

A: the missing step from zero to hero

I have spoken to many intelligent, self-motivated people that bear a sense of urgency about AI. They are willing to switch careers to doing research, but they are unable to get there. This is understandable: the path up to research-level understanding is lonely, arduous, long, and uncertain. It is like a pilgrimage.

One has to study concepts from the papers in which they first appeared. This is not easy. Such papers are undistilled. Unless one is lucky, there is no one to provide guidance and answer questions. Then should one come out on top, there is no guarantee that the quality of their work will be sufficient for a paycheck or a useful contribution.

Unless one is particularly risk-tolerant or has a perfect safety net, they will not be able to fully take the plunge.
I believe plenty of measures can be made to make getting into AI safety more like an "It's a small world"-ride:

  • Let there be a tested path with signposts along the way to make progress clear and measurable.

  • Let there be social reinforcement so that we are not hindered but helped by our instinct for conformity.

  • Let there be high-quality explanations of the material to speed up and ease the learning process, so that it is cheap.


B: the giant unrelenting research machine that we don’t use

The majority of researchers nowadays build their careers through academia. The typical story is for an academic to become acquainted with various topics during their study, pick one that is particularly interesting, and work on it for the rest of their career.

I have learned through personal experience that AI safety can be very interesting, and the reason it isn’t so popular yet is all about lack of exposure. If students were to be acquainted with the field early on, I believe a sizable amount of them would end up working in it (though this is an assumption that should be checked).

AI safety is in an innovator phase. Innovators are highly risk-tolerant and have a large amount of agency, which allows them to survive an environment with little guidance, polish or supporting infrastructure. Let us not fall for the typical mind fallacy, expecting less risk-tolerant people to move into AI safety all by themselves. Academia can provide that supporting infrastructure that they need.


AASAA adresses both of these issues. It has 2 phases:

A: Distill the field of AI safety into a high-quality MOOC: “Introduction to AI safety”

B: Use the MOOC as a proof of concept to convince universities to teach the field

 

read more...

 

We are bottlenecked for volunteers and ideas. If you'd like to help out, even if just by sharing perspective, fill in this form and I will invite you to the slack and get you involved.

Humans are not agents: short vs long term

4 Stuart_Armstrong 09 June 2017 11:16AM

Crossposted at the Intelligent Agents Forum.

This is an example of humans not being (idealised) agents.

Imagine a human who has a preference to not live beyond a hundred years. However, they want to live to next year, and it's predictable that every year they are alive, they will have the same desire to survive till the next year.

This human (not a completely implausible example, I hope!) has a contradiction between their long and short term preferences. So which is accurate? It seems we could resolve these preferences in favour of the short term ("live forever") or the long term ("die after a century") preferences.

Now, at this point, maybe we could appeal to meta-preferences - what would the human themselves want, if they could choose? But often these meta-preferences are un- or under-formed, and can be influenced by how the question or debate is framed.

Specifically, suppose we are scheduling this human's agenda. We have the choice of making them meet one of two philosophers (not meeting anyone is not an option). If they meet Professor R. T. Long, he will advise them to follow long term preferences. If instead, they meet Paul Kurtz, he will advise them to pay attention their short term preferences. Whichever one they meet, they will argue for a while and will then settle on the recommended preference resolution. And then they will not change that, whoever they meet subsequently.

Since we are doing the scheduling, we effectively control the human's meta-preferences on this issue. What should we do? And what principles should we use to do so?

It's clear that this can apply to AIs: if they are simultaneously aiding humans as well as learning their preferences, they will have multiple opportunities to do this sort of preference-shaping.

Regulatory lags for New Technology [2013 notes]

5 gwern 31 May 2017 01:27AM

I found some old notes from June 2013 on time delays in how fast one can expect Western political systems & legislators to respond to new technical developments.

In general, response is slow and on the order of political cycles; one implication I take away is that a takeoff an AI could happen over half a decade or more without any meaningful political control and would effectively be a ‘fast takeoff’, especially if it avoids any obvious mistakes.

1 Regulatory lag

“Regulatory delay” is the delay between the specific action required by regulators or legislatures to permit some new technology or method and the feasibility of the technology or method; “regulatory lag” is the converse, then, and is the gap between feasibility and reactive regulation of new technology. Computer software (and artificial intelligence in particular) is mostly unregulated, so it is subject to lag rather than delay.

Unfortunately almost all research seems to focus on modeling lags in the context of heavily regulated industries (especially natural monopolies like insurance or utilities), and few focus on compiling data on how long a lag can be expected between a new innovation or technology and its regulation. As one would expect, the few results point to lags on the order of years; for example, Ippolito 1979 (“The Effects of Price Regulation in the Automobile Insurance Industry”) finds that the period of price changes goes from 11 months in unregulated US states to 21 months in regulated states, suggesting the price-change framework itself causes a lag of almost a year.

Below, I cover some specific examples, attempting to estimate the lags myself:

(Nuclear weapons would be an interesting example but it’s hard to say what ‘lag’ would be inasmuch as they were born in government control and are subject to no meaningful global control; however, if the early proposals for a world government or unified nuclear weapon organization had gone through, they would also have represented a lag of at least 5 years.)

continue reading »

Divergent preferences and meta-preferences

4 Stuart_Armstrong 30 May 2017 07:33AM

Crossposted at the Intelligent Agents Forum.

In simple graphical form, here is the problem of divergent human preferences:

Here the AI either chooses A or ¬A, and as a consequence, the human then chooses B or ¬B.

There are a variety of situations in which this is or isn't a problem (when A or B or their negations aren't defined, take them to be the negative of what is define):

  • Not problems:
    • A/¬A = "gives right shoe/left shoe", B/¬B = "adds left shoe/right shoe".
    • A =  "offers drink", ¬B = "goes looking for extra drink".
    • A = "gives money", B = "makes large purchase".
  • Potentially problems:
    • A/¬A = "causes human to fall in love with X/Y", B/¬B = "moves to X's/Y's country".
    • A/¬A = "recommends studying X/Y", B/¬B = "choose profession P/Q".
    • A = "lets human conceive child", ¬B = "keeps up previous hobbies and friendships".
  • Problems:
    • A = "coercive brain surgery", B = anything.
    • A = "extreme manipulation", B = almost anything.
    • A = "heroin injection", B = "wants more heroin".

So, what are the differences? For the "not problems", it makes sense to model the human as having a single reward R, variously "likes having a matching pair of shoes", "needs a certain amount of fluids", and "values certain purchases". Then all that the the AI is doing is helping (or not) the human towards that goal.

As you move more towards the "problems", notice that they seem to have two distinct human reward functions, RA and R¬A, and that the AI's actions seem to choose which one the human will end up with. In the spirit of humans not being agents, this seems to be AI determining what values the human will come to possess.

 

Grue, Bleen, and agency

Of course, you could always say that the human actually has reward R = IARA + (1-IA)R¬A, where IA is the indicator function as to whether the AI does action A or not.

Similarly to the grue and bleen problem, there is no logical way of distinguishing that "pieced-together" R from a more "natural" R (such as valuing pleasure, for instance). Thus there is no logical way of distinguishing the human being an agent from the human not being an agent, just from its preferences and behaviour.

However, from a learning and computational complexity point of view, it does make sense to distinguish "natural" R's (where RA and R¬A are essentially the same, despite the human's actions being different) from composite R's.

This allows us to define:

  • Preference divergence point: A preference divergence point is one where RA and R¬A are sufficiently distinct, according to some criteria of distinction.

Note that sometimes, RA = RA' + R' and R¬A = R¬A' + R': the two RA and R¬A overlap on a common piece R', but diverge on RA' and R¬A'. It makes sense to define this as a preference divergence point as well, if RA'and R¬A' are "important" in the agent's subsequent decisions. Importance being a somewhat hazy metric, which would, for instance, assess how much R' reward the human would sacrifice to increase RA' and R¬A'.

 

Meta-preferences

From the perspective of revealed preferences about the human, R(μ)=IARA + μ(1-IA) R¬A will predict the same behaviour for all scaling factors μ > 0.

Thus at a preference divergence point, the AI's behaviour, if it was a R(μ) maximiser, would depend on the non-observed weighting between the two divergent preferences.

This is unsafe, especially if one of the divergent preferences is much easier to achieve a high value with than the other.

Thus preference divergence points are moments when the AI should turn explicitly to human meta-preferences to distinguish between them.

This can be made recursive - if we see the human meta-preferences as explicitly weighting RA versus R¬A and hence giving R, then if there is a prior AI decision point Z, and, depending on what the AI chooses, the human meta-preferences will be different, this gives two reward functions RZ=IARA+ μZ(1-IA)R¬A and R¬Z=IARA+ μ¬Z(1-IA)R¬A with different weights μZ and μ¬Z.

If these weights are sufficiently distinct, this could identify a meta-preference divergence point and hence a point where human meta-meta-preferences become relevant.

Looking for machine learning and computer science collaborators

9 Stuart_Armstrong 26 May 2017 11:53AM

I've been recently struggling to translate my various AI safety ideas (low impact, truth for AI, Oracles, counterfactuals for value learning, etc...) into formalised versions that can be presented to the machine learning/computer science world in terms they can understand and critique.

What would be useful for me is a collaborator who knows the machine learning world (and preferably had presented papers at conferences) which who I could co-write papers. They don't need to know much of anything about AI safety - explaining the concepts to people unfamiliar with them is going to be part of the challenge.

The result of this collaboration should be things like the paper of Safely Interruptible Agents with Laurent Orseau of Deep Mind, and Interactive Inverse Reinforcement Learning with Jan Leike of the FHI/Deep Mind.

It would be especially useful if the collaborators were located physically close to Oxford (UK).

Let me know if you know or are a potential candidate, in the comments.

Cheers!

AI safety: three human problems and one AI issue

9 Stuart_Armstrong 19 May 2017 10:48AM

Crossposted at the Intelligent agent foundation.

There have been various attempts to classify the problems in AI safety research. Our old Oracle paper that classified then-theoretical methods of control, to more recent classifications that grow out of modern more concrete problems.

These all serve their purpose, but I think a more enlightening classification of the AI safety problems is to look at what the issues we are trying to solve or avoid. And most of these issues are problems about humans.

Specifically, I feel AI safety issues can be classified as three human problems and one central AI issue. The human problems are:

  • Humans don't know their own values (sub-issue: humans know their values better in retrospect than in prediction).
  • Humans are not agents and don't have stable values (sub-issue: humanity itself is even less of an agent).
  • Humans have poor predictions of an AI's behaviour.

And the central AI issue is:

  • AIs could become extremely powerful.

Obviously if humans were agents and knew their own values and could predict whether a given AI would follow those values or not, there would be not problem. Conversely, if AIs were weak, then the human failings wouldn't matter so much.

The points about human values is relatively straightforward, but what's the problem with humans not being agents? Essentially, humans can be threatened, tricked, seduced, exhausted, drugged, modified, and so on, in order to act seemingly against our interests and values.

If humans were clearly defined agents, then what counts as a trick or a modification would be easy to define and exclude. But since this is not the case, we're reduced to trying to figure out the extent to which something like a heroin injection is a valid way to influence human preferences. This makes both humans susceptible to manipulation, and human values hard to define.

Finally, the issue of humans having poor predictions of AI is more general than it seems. If you want to ensure that an AI has the same behaviour in the testing and training environment, then you're essentially trying to guarantee that you can predict that the testing environment behaviour will be the same as the (presumably safe) training environment behaviour.

 

How to classify methods and problems

That's well and good, but how to various traditional AI methods or problems fit into this framework? This should give us an idea as to whether the framework is useful.

It seems to me that:

 

  • Friendly AI is trying to solve the values problem directly.
  • IRL and Cooperative IRL are also trying to solve the values problem. The greatest weakness of these methods is the not agents problem.
  • Corrigibility/interruptibility are also addressing the issue of humans not knowing their own values, using the sub-issue that human values are clearer in retrospect. These methods also overlap with poor predictions.
  • AI transparency is aimed at getting round the poor predictions problem.
  • Laurent's work on carefully defining the properties of agents is mainly also about solving the poor predictions problem.
  • Low impact and Oracles are aimed squarely at preventing AIs from becoming powerful. Methods that restrict the Oracle's output implicitly accept that humans are not agents.
  • Robustness of the AI to changes between testing and training environment, degradation and corruption, etc... ensures that humans won't be making poor predictions about the AI.
  • Robustness to adversaries is dealing with the sub-issue that humanity is not an agent.
  • The modular approach of Eric Drexler is aimed at preventing AIs from becoming too powerful, while reducing our poor predictions.
  • Logical uncertainty, if solved, would reduce the scope for certain types of poor predictions about AIs.
  • Wireheading, when the AI takes control of reward channel, is a problem that humans don't know their values (and hence use an indirect reward) and that the humans make poor predictions about the AI's actions.
  • Wireheading, when the AI takes control of the human, is as above but also a problem that humans are not agents.
  • Incomplete specifications are either a problem of not knowing our own values (and hence missing something important in the reward/utility) or making poor predictions (when we though that a situation was covered by our specification, but it turned out not to be).
  • AIs modelling human knowledge seem to be mostly about getting round the fact that humans are not agents.

Putting this all in a table:

 

MethodValues
Not Agents
Poor PredictionsPowerful
Friendly AI
X


IRL and CIRL X


Corrigibility/interruptibility X
X
AI transparency

X
Laurent's work

X
Low impact and Oracles
X
X
Robustness

X
Robustness to adversaries
X

Modular approach

X X
Logical uncertainty

X
Wireheading (reward channel) X X X
Wireheading (human) X
X
Incomplete specifications X
X
AIs modelling human knowledge
X

 

Further refinements of the framework

It seems to me that the third category - poor predictions - is the most likely to be expandable. For the moment, it just incorporates all our lack of understanding about how AIs would behave, but this might more useful to subdivide.

[Link] Keeping up with deep reinforcement learning research: /r/reinforcementlearning

3 gwern 16 May 2017 07:12PM

AI arms race

5 Stuart_Armstrong 04 May 2017 10:59AM

Racing to the Precipice: a Model of Artificial Intelligence Development

by Stuart Armstrong, Nick Bostrom, and Carl Shulman

This paper presents a simple model of an AI arms race, where several development teams race to build the first AI. Under the assumption that the first AI will be very powerful and transformative, each team is incentivised to finish first – by skimping on safety precautions if need be. This paper presents the Nash equilibrium of this process, where each team takes the correct amount of safety precautions in the arms race. Having extra development teams and extra enmity between teams can increase the danger of an AI-disaster, especially if risk taking is more important than skill in developing the AI. Surprisingly, information also increases the risks: the more teams know about each others’ capabilities (and about their own), the more the danger increases.

 

[Link] Moral Robots: Making sense of robot ethics. News aggregator

0 morganism 29 April 2017 09:51PM

The AI Alignment Problem Has Already Been Solved(?) Once

27 SquirrelInHell 22 April 2017 01:24PM

ALBA: can you be "aligned" at increased "capacity"?

3 Stuart_Armstrong 13 April 2017 07:23PM

Crossposted at the Intelligent Agents Forum.

I think that Paul Christiano's ALBA proposal is good in practice, but has conceptual problems in principle.

Specifically, I don't think it makes sense to talk about bootstrapping an "aligned" agent to one that is still "aligned" but that has an increased capacity.

The main reason being that I don't see "aligned" as being a definition that makes sense distinct from capacity.

 

These are not the lands of your forefathers

Here's a simple example: let r be a reward function that is perfectly aligned with human happiness within ordinary circumstances (and within a few un-ordinary circumstances that humans can think up).

Then the initial agent - B0, a human - trains a reward r1 for an agent A1. This agent is limited in some way - maybe it doesn't have much speed or time - but the aim is for r1 to ensure that A1 is aligned with B0.

Then the capacity of A1 is increased to B1, a slow powerful agent. It computers the reward r2 to ensure the alignment of A2, and so on.

The nature of the Bj agents is not defined - they might be algorithms calling Ai for i ≤ j as subroutines, humans may be involved, and so on.

If the humans are unimaginative and don't deliberately seek out more extreme and exotic test cases, the best case scenario is for ri → r as i → ∞.

And eventually there will be an agent An that is powerful enough to overwhelm the whole system and take over. It will do this in full agreement with Bn-1, because they share the same objective. And then An will push the world into extra-ordinary circumstance and proceed to maximise r, with likely disastrous results for us humans.

 

The nature of the problem

So what went wrong? At what point did the agents go out of alignment?

In one sense, at An. In another sense, at A1 (and, in another interesting sense, at B0, the human). The reward r was aligned, as long as the agent stayed near the bounds of the ordinary. As soon as it was no longer restricted to that, it went out of alignment, not because of a goal drift, but because of a capacity increase.

[Link] "Future of Go" summit with AlphaGo

3 gjm 10 April 2017 11:10AM

How AI/AGI/Consciousness works - my layman theory

0 rayalez 09 March 2017 09:17AM

This is just my layman theory. Maybe it’s obvious to experts, probably has flaws. But it seems to make sense to me, perhaps will give you some ideas. I would love to hear your thoughts/feedback!

 


Consume input

The data you need from the world(like video), and useful metrics we want to optimize for, like number of paperclips in the world.

 

Make predictions and take action

Like deep learning does.

How do human brains convert their structure into action?

Maybe like:

- Take the current picture of the world as an input.

- Come up with random action.

- “Imagine” what will happen.

Take the current world + action, and run it through the ANN. Predict the outcome of the action applied to the world.

- Does the output increase the metrics we want? If yes — send out the signals to take action. If no — come up with another random action and repeat.

 

Update beliefs

Look at the outcome of the action. Does the picture of the world correspond to the picture we’ve imagined? Did this action increase the good metrics? Did the number of paperclips in the world increase? If it did — positive reinforcement. Backpropagation, and reinforce the weights.

 

Repeat

Take current picture of the world=> Imagine applying an action to it => Take action => Positive/Negative reinforcement to improve our model => Repeat until the metrics we want equal to the goal we have set.

 


 

Consciousness

Consciousness is neurons observing/recognizing patterns of other neurons.

When you see the word “cat”— photons from the page come to your retina and are converted to neural signal. A network of cells recognizes the shape of letters C, A, and T. And then a higher level, more abstract network recognizes that these letters together form the concept of a cat.

You can also recognize signals coming from the nerve cells within your body, like feeling a pain when stabbing a toe.

The same way, neurons in the brain recognize the signals coming from the other neurons within the brain. So the brain “observes/feels/experiences” itself. Builds a model of itself, just like it builds a map of the world around, “mirrors” itself(GEB).

 

Sentient and self-improving

So the structure of the network itself is fed as one of it’s inputs, along with the video and metrics we want to optimize for. It can see itself as a part of the state of the world it bases predictions on. That’s what being sentient means.

And then one of the possible actions it can take is to modify it’s own structure. “Imagine” modifyng the structure a certain way, if you predict that it leads to the better predictions/outcomes —modify it. If it did lead to more paperclips — reinforce the weights to do more of that. So it keeps continually self improving.

 

Friendly

We don’t want this to lead to the infinite amount of paperclips, and we don’t know how to quantify the things we value as humans. We can’t turn the “amount of happiness” in the world into a concrete metrics without the unintended consequences(like all human brains being hooked up to wires that stimulate our pleasure centers).

That’s why instead of trying to encode the abstract values to maximize for, we encode very specific goals.

- Make 100 paperclips (utility function is “Did I make 100 paperclips?”)

- Build 1000 cars

- Write a paper on how to cure cancer

Humans remain in charge, determine the goals we want, and let AI figure out how to accomplish them. Still could go wrong, but less likely.


(originally published on my main blog)

[Link] Weaponising Twitter bots and political algos.

1 morganism 05 March 2017 09:39PM

[Link] What Should the Average EA Do About AI Alignment?

4 Raemon 25 February 2017 08:37PM

Translation "counterfactual"

1 Stuart_Armstrong 24 February 2017 06:36PM

Crossposted at Intelligent Agent Forum

In a previous post, I briefly mentioned translations as one of three possible counterfactuals for indifference. Here I want to clarify what I meant there, because the idea is interesting.

continue reading »

Nearest unblocked strategy versus learning patches

6 Stuart_Armstrong 23 February 2017 12:42PM

Crossposted at Intelligent Agents Forum.

The nearest unblocked strategy problem (NUS) is the idea that if you program a restriction or a patch into an AI, then the AI will often be motivated to pick a strategy that is as close as possible to the banned strategy, very similar in form, and maybe just as dangerous.

For instance, if the AI is maximising a reward R, and does some behaviour Bi that we don't like, we can patch the AI's algorithm with patch Pi ('maximise R0 subject to these constraints...'), or modify R to Ri so that Bi doesn't come up. I'll focus more on the patching example, but the modified reward one is similar.

continue reading »

[Link] DARPA Perspective on AI

1 morganism 23 February 2017 03:27AM

Indifference and compensatory rewards

3 Stuart_Armstrong 15 February 2017 02:49PM

Crossposted at the Intelligent Agents Forum

It's occurred to me that there is a framework where we can see all "indifference" results as corrective rewards, both for the utility function change indifference and for the policy change indifference.

Imagine that the agent has reward R0 and is following policy π0, and we want to change it to having reward R1 and following policy π1.

Then the corrective reward we need to pay it, so that it doesn't attempt to resist or cause that change, is simply the difference between the two expected values:

V(R0|π0)-V(R1|π1),

where V is the agent's own valuation of the expected reward, conditional on the policy.

This explains why off-policy reward-based agents are already safely interruptible: since we change the policy, not the reward, R0=R1. And since off-policy agents have value estimates that are indifferent to the policy followed, V(R0|π0)=V(R1|π1), and the compensatory rewards are zero.

Allegory On AI Risk, Game Theory, and Mithril

25 James_Miller 13 February 2017 08:41PM

“Thorin, I can’t accept your generous job offer because, honestly, I think that your company might destroy Middle Earth.”  

 

“Bifur, I can tell that you’re one of those “the Balrog is real, evil, and near” folks who thinks that in the next few decades Mithril miners will dig deep enough to wake the Balrog causing him to rise and destroy Middle Earth.  Let’s say for the sake of argument that you’re right.  You must know that lots of people disagree with you.  Some don’t believe in the Balrog, others think that anything that powerful will inevitably be good, and more think we are hundreds or even thousands of years away from being able to disturb any possible Balrog.  These other dwarves are not going to stop mining, especially given the value of Mithril.  If you’re right about the Balrog we are doomed regardless of what you do, so why not have a high paying career as a Mithril miner and enjoy yourself while you can?”  

 

“But Thorin, if everyone thought that way we would be doomed!”

 

“Exactly, so make the most of what little remains of your life.”

 

“Thorin, what if I could somehow convince everyone that I’m right about the Balrog?”

 

“You can’t because, as the wise Sinclair said, ‘It is difficult to get a dwarf to understand something, when his salary depends upon his not understanding it!’  But even if you could, it still wouldn’t matter.  Each individual miner would correctly realize that just him alone mining Mithril is extraordinarily unlikely to be the cause of the Balrog awakening, and so he would find it in his self-interest to mine.  And, knowing that others are going to continue to extract Mithril means that it really doesn’t matter if you mine because if we are close to disturbing the Balrog he will be awoken.” 

 

“But dwarves can’t be that selfish, can they?”  

 

“Actually, altruism could doom us as well.  Given Mithril’s enormous military value many cities rightly fear that without new supplies they will be at the mercy of cities that get more of this metal, especially as it’s known that the deeper Mithril is found, the greater its powers.  Leaders who care about their citizen’s safety and freedom will keep mining Mithril.  If we are soon all going to die, altruistic leaders will want to make sure their people die while still free citizens of Middle Earth.”

 

“But couldn’t we all coordinate to stop mining?  This would be in our collective interest.”

 

“No, dwarves would cheat rightly realizing that if just they mine a little bit more Mithril it’s highly unlikely to do anything to the Balrog, and the more you expect others to cheat, the less your cheating matters as to whether the Balrog gets us if your assumptions about the Balrog are correct.”  

 

“OK, but won’t the rich dwarves step in and eventually stop the mining?  They surely don’t want to get eaten by the Balrog.”   

 

“Actually, they have just started an open Mithril mining initiative which will find and then freely disseminate new and improved Mithril mining technology.  These dwarves earned their wealth through Mithril, they love Mithril, and while some of them can theoretically understand how Mithril mining might be bad, they can’t emotionally accept that their life’s work, the acts that have given them enormous success and status, might significantly hasten our annihilation.”

 

“Won’t the dwarven kings save us?  After all, their primary job is to protect their realms from monsters.

 

“Ha!  They are more likely to subsidize Mithril mining than to stop it.  Their military machines need Mithril, and any king who prevented his people from getting new Mithril just to stop some hypothetical Balrog from rising would be laughed out of office.  The common dwarf simply doesn’t have the expertise to evaluate the legitimacy of the Balrog claims and so rightly, from their viewpoint at least, would use the absurdity heuristic to dismiss any Balrog worries.  Plus, remember that the kings compete with each other for the loyalty of dwarves and even if a few kings came to believe in the dangers posed by the Balrog they would realize that if they tried to imposed costs on their people, they would be outcompeted by fellow kings that didn’t try to restrict Mithril mining.  Bifur, the best you can hope for with the kings is that they don’t do too much to accelerating Mithril mining.”

 

“Well, at least if I don’t do any mining it will take a bit longer for miners to awake the Balrog.”

 

“No Bifur, you obviously have never considered the economics of mining.  You see, if you don’t take this job someone else will.  Companies such as ours hire the optimal number of Mithril miners to maximize our profits and this number won’t change if you turn down our offer.”

 

“But it takes a long time to train a miner.  If I refuse to work for you, you might have to wait a bit before hiring someone else.”

 

“Bifur, what job will you likely take if you don’t mine Mithril?”

 

“Gold mining.”

 

“Mining gold and Mithril require similar skills.  If you get a job working for a gold mining company, this firm would hire one less dwarf than it otherwise would and this dwarf’s time will be freed up to mine Mithril.  If you consider the marginal impact of your actions, you will see that working for us really doesn’t hasten the end of the world even under your Balrog assumptions.”  

 

“OK, but I still don’t want to play any part in the destruction of the world so I refuse work for you even if this won’t do anything to delay when the Balrog destroys us.”

 

“Bifur, focus on the marginal consequences of your actions and don’t let your moral purity concerns cause you to make the situation worse.  We’ve established that your turning down the job will do nothing to delay the Balrog.  It will, however, cause you to earn a lower income.  You could have donated that income to the needy, or even used it to hire a wizard to work on an admittedly long-shot, Balrog control spell.  Mining Mithril is both in your self-interest and is what’s best for Middle Earth.” 


[Link] Changes in AI Safety Funding

3 siIver 11 February 2017 08:36AM

[Link] Slate Star Codex Notes on the Asilomar Conference on Beneficial AI

13 Gunnar_Zarncke 07 February 2017 12:14PM

Request for collaborators - Survey on AI risk priorities

2 whpearson 06 February 2017 08:14PM

After some conversations here I thought I would try and find out what the community of people who care about AI risk think are the priorities for research.

To represent peoples opinions fairly I wanted to get input from people who care about the future of intelligence. Also I figure that other people will have more experience designing and analyzing surveys than me and getting their help or advice would be a good plan.

Planning document

Here is the planning document, give me a shout if you want edit rights. I'll be filling in the areas for research over the next week or so.

I'll set up a trello if I get a few people interested.

True understanding comes from passing exams

6 Stuart_Armstrong 06 February 2017 11:51AM

Crossposted at the Intelligent Agent Forum

I'll try to clarify what I was doing with the AI truth setup in a previous post. First I'll explain the nature of the challenge, and then how the setup tries to solve it.

The nature of the challenge is to have an AI give genuine understanding to a human. Getting the truth out of an AI or Oracle is not that hard, conceptually: you get the AI to report some formal property of its model. The problem is that that truth can be completely misleading, or, more likely, incomprehensible.

continue reading »

View more: Next