Double Corrigibility: better Corrigibility
A putative new idea for AI control; index here.
Corrigibility was an attempt to allow an AI to safely change its values, without seeking to provoke or avoid the change. The idea is that, when the AI's utility changes from u to v at time t, it maximises a meta-utility U such that
- U≤t = u
- U>t = v + E(u|u→u) - E(v|u→v).
Here u→u designates the event that u remains unchanged, while u→v designates the change.
As has been shown, adding those expectation terms means the AI will not seek to provoke or resist such a utility change - it is indifferent to the change. Those expectation terms can be designated compensatory rewards.
What it is not indifferent to, though, is changes to E. If the AI knows that change u→v is coming, then it has incentives to change E to E' with a high E'(u|u→u) and a low E'(v|u→v). Its ability to rewire is limited by the fact that E' will serve as its future expectation estimator, so it won't mangle it completely, but it certainly can twist it a bit.
Notice that the AI is now no longer indifferent to the change of utility. It can benefit from rewiring E this way only if a change is coming up. Notice that E' probably can't rewire itself any more, so the rewired AI will be indifferent, but not before rewiring.
Double corrigibility
To solve this problem, we need to turn to corrigibility a second time. Let ρ be the AI's current expectation/probability estimator. Let ρ' be one that it can conceivably change into. Let C be a random variable designating the sum total of all compensatory rewards. Now, to guarantee that C is finite would require bounds on all the utility functions and some discount rate. In practice, if often will be finite. That's because E(C)=E(u0|u0→u0)-E(un |un-1→un, H), where u0 is the AI's first utility, un its last, and H intermediate history. So some milder restrictions on the ui should suffice to bound C.
Then, whenever the AI takes an action that changes ρ to ρ', it gets the special meta-compensatory reward
- Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ').
This prevents it from deriving any advantage to C via changes to ρ. Or, more precisely, it prevents it from deriving any expected advantage, that it can figure out using its current ρ.
Note that it doesn't prevent improvements to ρ in general, it just removes the incentives to change it to affect compensatory rewards.
Since any action a might have an indirect effect on ρ, the "utility maximising" for a given u must be changed to:
- Eρ(u|a) + Σρ' Pρ(ρ→ρ'|a) (Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ')),
where Pρ is the probability estimate corresponding to ρ; the probability term can be rewritten as Eρ(Iρ→ρ') for Iρ→ρ' the indicator function for ρ→ρ'. In fact the whole line above can be rewritten as
- Eρ(u|a) + Eρ(Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ') | a).
For this to work, Eρ needs to be able to say sensible things about itself, and also about Eρ', which is used to estimate C if ρ→ρ'.
If we compare this with various ways of factoring out variables, we can see that it's a case where we have a clear default, ρ, and are estimating deviations from that.
One weird trick to turn maximisers into minimisers
A putative new idea for AI control; index here.
A simple and easy design for a u-maximising agent that turns into a u-minimising one.
Let X be some boolean random variable outside the agent's control, that will be determined at some future time t (based on a cosmic event, maybe?). Set it up so that P(X=1)=ε, and for a given utility u, consider the utility:
- u# = (2/ε)Xu - u.
Before t, the expected value of (2/ε)X is 2, so u# = u. Hence the agent is a u-maximiser. After t, the most likely option is X=0, hence a little bit of evidence to that effect is enough to make u# into a u-minimiser.
This isn't perfect corrigibility - the agent would be willing to sacrifice a bit of u-value (before t) in order to maintain its flexibility after t. To combat this effect, we could instead use:
- u# = Ω(2/ε)Xu - u.
If Ω is large, then the agent is willing to pay very little u-value to maintain flexibility. However, the amount of evidence of X=0 that it needs to become a u-minimiser is equally proportional to Ω, so X better be a clear and convincing event.
Lesswrong Potential Changes
I have compiled many suggestions about the future of lesswrong into a document here:
https://docs.google.com/document/d/1hH9mBkpg2g1rJc3E3YV5Qk-b-QeT2hHZSzgbH9dvQNE/edit?usp=sharing
It's long and best formatted there.
In case you hate leaving this website here's the summary:
Summary
There are 3 main areas that are going to change.
-
Technical/Direct Site Changes
-
-
new home page
-
new forum style with subdivisions
-
new sub for “friends of lesswrong” (rationality in the diaspora)
-
-
New tagging system
-
New karma system
-
Better RSS
-
-
Social and cultural changes
-
Positive culture; a good place to be.
-
Welcoming process
-
Pillars of good behaviours (the ones we want to encourage)
-
Demonstrate by example
-
3 levels of social strategies (new, advanced and longtimers)
-
-
Content (emphasis on producing more rationality material)
-
For up-and-coming people to write more
-
for the community to improve their contributions to create a stronger collection of rationality.
-
-
For known existing writers
-
To encourage them to keep contributing
- To encourage them to work together with each other to contribute
-
-
How will we know we have done well (the feel of things)
How will we know we have done well (KPI - technical)
Initiatives for long-time users
Target: a good 3 times a week for a year.
Approach formerly prominent writers
Place to talk with other rationalists
Pillars of purpose
(with certain sub-reddits for different ideas)
Encourage a declaration of intent to post
(with certain sub-reddits for different ideas)
Why change LW?
Lesswrong has gone through great times of growth and seen a lot of people share a lot of positive and brilliant ideas. It was hailed as a launchpad for MIRI, in that purpose it was a success. At this point it’s not needed as a launchpad any longer. While in the process of becoming a launchpad it became a nice garden to hang out in on the internet. A place of reasonably intelligent people to discuss reasonable ideas and challenge each other to update their beliefs in light of new evidence. In retiring from its “launchpad” purpose, various people have felt the garden has wilted and decayed and weeds have grown over. In light of this; and having enough personal motivation to decide I really like the garden, and I can bring it back! I just need a little help, a little magic, and some little changes. If possible I hope for the garden that we all want it to be. A great place for amazing ideas and life-changing discussions to happen.
How will we know we have done well (the feel of things)
Success is going to have to be estimated by changes to the feel of the site. Unfortunately that is hard to do. As we know outrage generates more volume than positive growth. Which is going to work against us when we try and quantify by measurable metrics. Assuming the technical changes are made; there is still going to be progress needed on the task of socially improving things. There are many “seasoned active users” - as well as “seasoned lurkers” who have strong opinions on the state of lesswrong and the discussion. Some would say that we risk dying of niceness, others would say that the weeds that need pulling are the rudeness.
Honestly we risk over-policing and under-policing at the same time. There will be some not-niceness that goes unchecked and discourages the growth of future posters (potentially our future bloggers), and at the same time some other niceness that motivates trolling behaviour as well as failing to weed out potential bad content which would leave us as fluffy as the next forum. there is no easy solution to tempering both sides of this challenge. I welcome all suggestions (it looks like a karma system is our best bet).
In the meantime I believe being on the general niceness, steelman side should be the motivated direction of movement. I hope to enlist some members as essentially coaches in healthy forum growth behaviour. Good steelmanning, positive encouragement, critical feedback as well as encouragement, a welcoming committee and an environment of content improvement and growth.
While at the same time I want everyone to keep up the heavy debate; I also want to see the best versions of ourselves coming out onto the publishing pages (and sometimes that can be the second draft versions).
So how will we know? By trying to reduce the ugh fields to people participating in LW, by seeing more content that enough people care about, by making lesswrong awesome.
The full document is just over 11 pages long. Please go read it, this is a chance to comment on potential changes before they happen.
Meta: This post took a very long time to pull together. I read over 1000 comments and considered the ideas contained there. I don't have an accurate account of how long this took to write; but I would estimate over 65 hours of work has gone into putting it together. It's been literally weeks in the making, I really can't stress how long I have been trying to put this together.
If you want to help, please speak up so we can help you help us. If you want to complain; keep it to yourself.
Thanks to the slack for keeping up with my progress and Vanvier, Mack, Leif, matt and others for reviewing this document.
As usual - My table of contents
Predicted corrigibility: pareto improvements
A putative new idea for AI control; index here.
Corrigibility allows an agent to transition smoothly from a perfect u-maximiser to a perfect v-maximiser, without seeking to resist or cause this transition.
And it's the very perfection of the transition that could cause problems; while u-maximising, the agent will not take the slightest action to increase v, even if such actions are readily available. Nor will it 'rush' to finish its u-maximising before transitioning. It seems that there's some possibility of improvements here.
I've already attempted one way of dealing with the issue (see the pre-corriged agent idea). This is another one.
Pareto improvements allowed
Suppose that an agent with corrigible algorithm A is following utility u currently, and estimates that there are probabilities pi that it will transition to utilities vi at midnight (note that these are utility function representatives, not affine classes of equivalent utility functions). At midnight, the usual corrigibility applies, making A indifferent to that transition, making use of such terms as E(u|u→u) (the expectation of u, given that the A's utility doesn't change) and E(vi|u→vi) (the expectation of vi, given that A's utility changes to vi).
But, in the meantime, there are expectations such as E({u,v1,v2,...}). These are A's best current estimates as to what the genuine expected utility of the various utilites are, given all it knows about the world and itself. It could be more explicitly written as E({u,v1,v2,...}| A), to emphasise that these expectations are dependent on the agent's own algorithm.
Then the idea is to modify the agent's algorithm so that Pareto improvements are possible. Call this modified algorithm B. B can select actions that A would not have chosen, conditional on:
- E(u|B) ≥ E(u|A) and E(Σpivi|B) ≥ E(Σpivi|A).
There are two obvious ways we could define B:
- B maximises u, subject to the constraints E(Σpivi|B) ≥ E(Σpivi|A).
- B maximises Σpivi, subject to the constraints E(u|B) ≥ E(u|A).
In the first case, the agent maximises its current utility, without sacrificing its future utility. This could apply, for example, to a ruby mining agent that rushes to gets its rubies to the bank before its utility changes. In the second case, the agent maximises it future expected utility, without sacrificing its current utility. This could apply to a ruby mining agent that's soon to become a sapphire mining agent: it then starts to look around and collect some early sapphires as well.
Now, it would seem that doing this must cause it to lose some ruby mining ability. However, it is being Pareto with E("rubies in bank"|A, expected future transition), not with E("rubies in bank"|A, "A remains a ruby mining agent forever"). The difference is that A will behave as if it was maximising the second term, and so might not go to the bank to deposit its gains, before getting hit by the transition. So B can collects some early sapphires, and also goes to the bank to deposit some rubies, and thus end up ahead for both u and Σpivi.
Why "Changing the World" is a Horrible Phrase
Steve Jobs famously convinced John Scully from Pepsi to join Apple Computer with the line, “Do you want to sell sugared water for the rest of your life? Or do you want to come with me and change the world?”. This sounds convincing until one thinks closely about it.
Steve Jobs was a famous salesman. He was known for his selling ability, not his honesty. His terminology here was interesting. ‘Change the world’ is a phrase that both sounds important and is difficult to argue with. Arguing if Apple was really ‘changing the world’ would have been pointless, because the phrase was so ambiguous that there would be little to discuss. On paper, of course Apple is changing the world, but then of course any organization or any individual is also ‘changing’ the world. A real discussion of if Apple ‘changes the world’ would lead to a discussion of what ‘changing the world’ actually means, which would lead to obscure philosophy, steering the conversation away from the actual point.
‘Changing the world’ is an effective marketing tool that’s useful for building the feeling of consensus. Steve Jobs used it heavily, as had endless numbers of businesses, conferences, nonprofits, and TV shows. It’s used because it sounds good and is typically not questioned, so I’m here to question it. I believe that the popularization of this phrase creates confused goals and perverse incentives from people who believe they are doing good things.
Problem 1: 'Changing the World' Leads to Television Value over Real Value
It leads nonprofit workers to passionately chase feeble things. I’m amazed by the variety that I see in people who try to ‘change the world’. Some grow organic food, some research rocks, some play instruments. They do basically everything.
Few people protest this variety. There are millions of voices giving the appeal to ‘change the world’ in the way that would validate many radically diverse pursuits.
TED, the modern symbol of the intellectual elite for many, is itself a grab bag of a ways to ‘change the world’, without any sense of scale between pursuits. People tell comedic stories, sing songs, discuss tales of personal adventures and so on. In TED Talks, all presentations are shown side-by-side with the same lighting and display. Yet in real life some projects produce orders of magnitude more output than others.
At 80,000 Hours, I read many applications for career consulting. I got the sense that there are many people out there trying to live their lives in order to eventually produce a TED talk. To them, that is what ‘changing the world’ means. These are often very smart and motivated people with very high opportunity costs.
I would see an application that would express interest in either starting an orphanage in Uganda, creating a woman's movement in Ohio, or making a conservatory in Costa Rica. It was clear that they were trying to ‘change the world’ in a very vague and TED-oriented way.
I believe that ‘Changing the World’ is promoted by TED, but internally acts mostly as a Schelling point. Agreeing on the importance of ‘changing the world’ is a good way of coming to a consensus without having to decide on moral philosophy. ‘Changing the world’ is simply the minimum common denominator for what that community can agree upon. This is a useful social tool, but an unfortunate side effect was that it inspired many others to follow this shelling point itself. Please don’t make the purpose of your life the lowest common denominator of a specific group of existing intellectuals.
It leads businesses to be gain employees and media attention without having to commit to anything. I’m living in Silicon Valley, and ‘Change the World’ is an incredibly common phrase for new and old startups. Silicon Valley (the TV show) made fun of it, as do much of the media. They should, but I think much of the time they miss the point; the problem here is not one where the companies are dishonest, but one where their honestly itself just doesn’t mean much. Declaring that a company is ‘changing the world’ isn’t really declaring anything.
Hiring conversations that begin and end with the motivation of ‘changing the world’ are like hiring conversations that begin and end with making ‘lots’ of money. If one couldn’t compare salaries between different companies, they would likely select poorly for salary. In terms of social benefit, most companies don’t attempt to quantify their costs and benefits on society except in very specific and positive ways for them. “Google has enabled Haiti disaster recovery” for social proof sounds to me like saying “We paid this other person $12,000 in July 2010” for salary proof. It sounds nice, but facts selected by a salesperson are simply not complete.
Problem 2: ‘Changing the World’ Creates Black and White Thinking
The idea that one wants to ‘change the world’ implies that there is such a thing as ‘changing the world’ and such a thing is ‘not changing the world’. It implies that there are ‘world changers’ and people who are not ‘world changers’. It implies that there is one group of ‘important people’ out there and then a lot of ‘useless’ others.
This directly supports the ‘Great Man’ theory, a 19th century idea that history and future actions are led by a small number of ‘great men’. There’s not a lot of academic research supporting this theory, but there’s a lot of attention to it, and it’s a lot of fun to pretend is true.
But it’s not. There is typically a lot of unglamorous work behind every successful project or organization. Behind every Steve Jobs are thousands of very intelligent and hard-working employees and millions of smart people who have created a larger ecosystem. If one only pays attention to Steve Jobs they will leave out most of the work. They will praise Steve Jobs far too highly and disregard the importance of unglamorous labor.
Typically much of the best work is also the most unglamorous. Making WordPress websites, sorting facts into analysis, cold calling donors. Many the best ideas for organizations may be very simple and may have been done before. However, for someone looking to get to TED conferences or become superstars, it is very easy to look over other comparatively menial labor. This means that not only will it not get done, but those people who do it feel worse about themselves.
So some people do important work and feel bad because it doesn’t meet the TED standard of ‘change the world’. Others try ridiculously ambitious things outside their own capabilities, fail, and then give up. Others don’t even try, because their perceived threshold is too high for them. The very idea of a threshold and a ‘change or don’t change the world’ approach is simply false, and believing something that’s both false and fundamentally important is really bad.
In all likelihood, you will not make the next billion-dollar nonprofit. You will not make the next billion-dollar business. You will not become the next congressperson in your district. This does not mean that you have not done a good job. It should not demoralize you in any way once you fail hardly to do these things.
Finally, I would like to ponder on what happens once or if one does decide they have changed the world. What now? Should one change it again?
It’s not obvious. Many retire or settle down after feeling accomplished. However, this is exactly when trying is the most important. People with the best histories have the best potentials. No matter how much a U.S. President may achieve, they still can achieve significantly more after the end of their terms. There is no ‘enough’ line for human accomplishment.
Conclusion
In summary the phrase change the world provides a lack of clear direction and encourages black-and-white thinking that distorts behaviors and motivation. However, I do believe that the phrase can act as a stepping stone towards a more concrete goal. ‘Change the World’ can act as an idea that requires a philosophical continuation. It’s a start for a goal, but it should be recognized that it’s far from a good ending.
Next time someone tells you about ‘changing the world’, ask them to follow through with telling you the specifics of what they mean. Make sure that they understand that they need to go further in order to mean anything.
And more importantly, do this for yourself. Choose a specific axiomatic philosophy or set of philosophies and aim towards those. Your ultimate goal in life is too important to be based on an empty marketing term.
Cosmic expansion vs uploads economics?
In a previous post (and the attendant paper and talks) I mentioned how easy it is to build a Dyson sphere around the sun (and start universal colonisation), given decent automation.
Decent automation includes, of course, the copyable uploads that form the basis of Robin Hanson's upload economics model. If uploads can gather vast new resources by Dysoning the sun using current or near future technology, this calls into question Robin's model that standard current economic assumptions can be extended to an uploads world.
And Dysoning the sun is just one way uploads could be completely transformative. There are certainly other ways, that we cannot yet begin to imagine, that uploads could radically transform human society in short order, making all our continuity assumptions and our current models moot. It would be worth investigating these ways, keeping in mind that we will likely miss some important ones.
Against this, though, is the general unforeseen friction argument. Uploads may be radically transformative, but probably on longer timescales than we'd expect.
Minor, perspective changing facts
There's a lot of background mess in our mental pictures of the world. We try and be accurate on important issues, but a whole lot of the less important stuff we pick up from the media, the movies, and random impressions. And once these impressions are in our mental pictures, they just don't go away - until we find a fact that causes us to say "huh", and reassess.
Here are three facts that have caused that "huh" in me, recently, and completely rearranged minor parts of my mental map. I'm sharing them here, because that experience is a valuable one.
- Think terrorist attack on Israel - did the phrase "suicide bombing" spring to mind? If so, you're so out of fashion: the last suicide bombing in Israel was in 2008 - a year where dedicated suicide bombers managed the feat of killing a grand total of 1 victim. Suicide bombings haven't happened in Israel for over half a decade.
- Large scale plane crashes seem to happen all the time, all over the world. They must happen at least a few times a year, in every major country, right? Well, if I'm reading this page right, the last time there was an airline crash in the USA that killed more that 50 people was... in 2001 (2 months after 9/11). Nothing on that scale since then. And though there has been crashes on route to/from Spain and France since then, it seems that major air crashes in western countries is something that essentially never happens.
- The major cost of a rocket isn't the fuel, as I'd always thought. It seems that the Falcon 9 rocket costs $54 million per launch, of which fuel is only $0.2 million (or, as I prefer to think of it - I could sell my house to get enough fuel to fly to space). In the difference between those two prices, lies the potential for private spaceflight to low-Earth orbit.
[LINK] Climate change and food security
A Guardian article on the impact of climate change on food security. This is worrying (albeit perhaps not a global catastrophic (or existential) risk). It has the potential to wipe out the gains made against extreme poverty in the last few decades.
Should we be so pessimistic? Climate change might be averted through government action or a technological fix; or the poorest might get rich enough to be protected from this insecurity; or we could see a second 'Green Revolution' with GM, etc. I've also seen some discussion that climate change could in fact increase food cultivation - in Russia and Canada for example.
How do people feel about this - optimistic or pessimistic?
How to un-kill your mind - maybe.
It has been the case since I had opinions on these things that I have struggled to identify my “favourite writer of all time”. I've thought perhaps it was Shakespeare, as everyone does – who composed over thirty plays in his lifetime, from any of which a single line would be so far beyond my ability as to make me laughable. Other times I've thought it may be Saul Bellow, who seems to understand human nature in an intuitive way I can't quite reach, but which always touches me when I read his books. And more often than not I've thought it was Raymond Chandler, who in each of his seven novels broke my heart and refused to apologise – because he knew what kind of universe we live in. But since perhaps the year 2007, I have, or should I say had, not been in the slightest doubt as to who my favourite living writer was – Christopher Eric Hitchens.
This post is not about how much I admired him. It's not about how surprisingly upset I was about his death (I have since said that I didn't know him except through his writing – a proposition something like “I didn't have sex with her except through her vagina”) - although I must say that even now thinking about this subject is having rather more of an effect on me than I would like. This post is about a rather strange change that has come over me since his death on the 15th of December. Before that time I was a staunch defender of the proposition that the removal of Saddam Hussein from power in Iraq was an obvious boon to the human race, and that the war in Iraq was therefore a wise and moral undertaking. Since then, however, I have found my opinion softening on the subject – I have found myself far more open to cost/ benefit analyses that have come down on the side of non-intervention, and much less indignant when others disagreed. It still seems to me that there are obvious benefits that have arisen from the war in Iraq – by no means am I willing to admit that it was an utter catastrophe, as so many seem convinced it was – but I have found my opinion shifting toward the non-committal middle ground of “I dunno”.
Well, Mrs. Mason didn't raise all that many fools. It could be that what's happening here is I'm identifying closely with the Ron Paul campaign, and that since I agree with Paul on many things but not on American foreign policy (and, as it happens, I'm British – but consider myself internationalist enough that American arguments significantly influence my views), and so am shifting towards his point of view. But I think it's rather more likely – embarrassing as this is to admit – that the sheer fact that the Hitch could no longer possibly be my friend – could no longer congratulate me on my enlightened point of view, or go into coalition with me against the forces of irrationality – has freed up my opinions on the Iraq war, and I have dropped into the centre-ground of “Not enough information”. This, as I said, is embarrassing – whether or not the best writer in the world approves of your opinion is no basis for sticking to it. But this is the position I find myself in: weak; fragile; irrational – at least as far as politics go.
So here is my half-way solution: extreme and not perfect, by any means, but I think, given the unearthing of this appalling weakness, necessary: from this point onwards, until January 1st 2013 (yes, an arbitrary point in the future), I am not allowed to settle on a political or moral opinion (ethics – the question of what constitutes the good life - I consider comparatively easy, and so exempt). Even when presented with apparently knock-down arguments, I am forbidden from professing allegiance from any moral or political position for the rest of the year. Yes, it is going to be hard to prevent myself from deciding on moral questions, or on political questions – but I am hoping that if I can at least prevent myself from defending any position for the rest of the year, I will, at the end of it, no longer be emotionally attached to any particular ideology, and be able to assess the difference at least semi-rationally. I don't want to believe anything just because Hitchens believed it. I don't want to be motivated by perceived-but-illusory friendship. I want the right answer. And I'm hoping that depriving my brain of the reinforcement that becoming part of a team – no matter how small – gives, I will be able to consider the matter rationally.
Until 2013, then, this is it for me. No longer are Marxism, fascism, anarcho-syndicalism etc. incorrect. They're interesting ideas, and I'd like to hear more about them. This is my slightly-less-than-a-year off from ideology. Let's hope that it works.
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)