Counterfactual do-what-I-mean
A putative new idea for AI control; index here.
The counterfactual approach to value learning could be used to possibly allow natural language goals for AIs.
The basic idea is that when the AI is given a natural language goal like "increase human happiness" or "implement CEV", it is not to figure out what these goals mean, but to follow what a pure learning algorithm would establish these goals as meaning.
This would be safer than a simple figure-out-the-utility-you're-currently-maximising approach. But it still doesn't solve a few drawbacks. Firstly, the learning algorithm has to be effective itself (in particular, modifying human understanding of the words should be ruled out, and the learning process must avoid concluding the simpler interpretations are always better). And secondly, humans' don't yet know what these words mean, outside our usual comfort zone, so the "learning" task also involves the AI extrapolating beyond what we know.
Internal Race Conditions
Time start: 14:40:36
I
You might be familiar with the concept of a 'bug', as introduced by CFAR. By using the computer programming analogy, it frames any problem you might have in your life as something fixable... even more - as something to be fixed, something such that fixing it or thinking about how to fix it is the first thing that comes to mind when you see such a problem, or 'bug'.
Let's try another analogy in the same style, with something called 'race conditions' in programming. A race condition as a particular type of bug, that is typically very hard to find and fix ('debug'). It occurs when two or more parts of the same program 'race' to access some data, resource, decision point etc., in a way that is not controlled by any organised principle.
For example, imagine that you have a document open in an editor program. You make some changes, you give a command to save the file. While this operation is in progress, you drag and drop the same file in a file manager, moving to another hard drive. In this case, depending on timing, on the details of the programs, and on the operating system that you are using, you might get different results. The old version of the file might be moved to the new location, while the new one is saved in the old location. Or the file might get saved first, and then moved. Or the saving operation will end in an error, or in a truncated or otherwise malformed file on the disk.
If you know enough details about the situation, you could in fact work out what exactly would happen. But the margin of error in your own handling of the software is so big, that you cannot in practice do this (e.g. you'd need to know the exact milisecond when you press buttons etc.). So in practice, the outcome is random, depending on how the events play out on a scale smaller that you can directly control (e.g. minute differences in timing, strength of reactions etc.).
II
What is the analogy in humans? One of the places in which when you look hard, you'll see this pattern a lot is the relation of emotions and conscious decision making.
E.g., a classic failure mode is a "commitment to emotions", which goes like this:
- I promise to love you forever
- however if I commit to this, I will have doubts and less freedom, which will generate negative emotions
- so I'll attempt to fall in love faster than my doubts grow
- let's do this anyway, why won't we?
The problem here is a typical emotional "race condition": there is a lot of variability in the outcome, depending on how events play out. There could be a "butterfly effect", in which e.g. a single weekend trip together could determine the fate of the relationship, by creating a swing up or down, which would give one side of emotions a head start in the race.
III
Another typical example is making a decision about continuing a relationship:
- when I spend time with you, I like you more
- when I like you more, I want to continue our relationship
- when we have a relationship, I spend more time with you
As you can see, there is a loop in decision process. This cannot possibly end well.
A wild emotional rollercoaster is probably around the least bad outcome of this setup.
IV
So how do you fix race conditions?
By creating structure.
By following principles which compute the result explicitly, without unwanted chaotic behaviour.
By removing loops from decision graphs.
First and foremost, by recognizing that leaving a decision to a race condition is strictly worse than any decision process that we consciously design, even if this process is flipping the coin (at least you know the odds!).
Example: deciding to continue the relationship.
Proposed solution (arrow represent influence):
(1) controlled, long-distance emotional evaluation -> (2) systemic decision -> (3) day-to-day emotions
The idea is to remove the loop by organising emotions into tho groups: those that are directly influenced by the decision or its consequences (3), and more distant "evaluation" emotions (1). A possibility to feel emotions as in (1) can be created by pre-deciding a time to have some time alone and judge the situation from more distance, e.g. "after 6 months of this relationship I will go for a 2 week vacation to by aunt in France, and think about it in a clear-headed way, making sure I consider emotions about the general picture, not day-to-day things like physical affection etc.".
V
There is much to write on this topic, so please excuse my brevity (esp. in the last part, giving some examples of systemic thinking about emotions) - there is easily enough content about this to fill a book (or two). But I hope I gave you some idea.
Time end: 15:15:42
Writing stats: 31 minutes, 23 wpm, 133 cpm
New LW Meetup: Zurich
This summary was posted to LW Main on October 21st. The following week's summary is here.
New meetups (or meetups with a hiatus of more than a year) are happening in:
Irregularly scheduled Less Wrong meetups are taking place in:
- Munich Meetup in October: 29 October 2016 04:00PM
- Stockholm: Mental contrasting: 21 October 2016 04:00PM
The remaining meetups take place in cities with regular scheduling, but involve a change in time or location, special meeting content, or simply a helpful reminder about the meetup:
- Bay Area Winter Solstice 2016: 17 December 2016 07:00PM
- [Moscow] Games in Kocherga club: FallacyMania, Tower of Chaos, Scientific Discovery: 26 October 2016 07:40PM
- NY Solstice 2016 - The Story of Smallpox: 17 December 2016 06:00PM
- San Francisco Meetup: Stories: 24 October 2016 06:15PM
- Washington, D.C.: Technology of Communication: 23 October 2016 03:30PM
Locations with regularly scheduled meetups: Austin, Berlin, Boston, Brussels, Buffalo, Canberra, Columbus, Denver, Kraków, London, Madison WI, Melbourne, Moscow, New Hampshire, New York, Philadelphia, Research Triangle NC, San Francisco Bay Area, Seattle, St. Petersburg, Sydney, Tel Aviv, Toronto, Vienna, Washington DC, and West Los Angeles. There's also a 24/7 online study hall for coworking LWers and a Slack channel for daily discussion and online meetups on Sunday night US time.
Weekly LW Meetups
This summary was posted to LW Main on September 30th. The following week's summary is here.
Irregularly scheduled Less Wrong meetups are taking place in:
The remaining meetups take place in cities with regular scheduling, but involve a change in time or location, special meeting content, or simply a helpful reminder about the meetup:
- Bay Area Winter Solstice 2016: 17 December 2016 07:00PM
- Melbourne: A Bayesian Guide on How to Read a Scientific Paper: 08 October 2016 03:30PM
- Sydney Rationality Dojo - October 2016: 02 October 2016 04:00PM
Locations with regularly scheduled meetups: Austin, Berlin, Boston, Brussels, Buffalo, Canberra, Columbus, Denver, Kraków, London, Madison WI, Melbourne, Moscow, New Hampshire, New York, Philadelphia, Research Triangle NC, San Francisco Bay Area, Seattle, Sydney, Tel Aviv, Toronto, Vienna, Washington DC, and West Los Angeles. There's also a 24/7 online study hall for coworking LWers and a Slack channel for daily discussion and online meetups on Sunday night US time.
Philosophical theory with an empirical prediction
I have a philosophical theory which implies some things empirically about quantum physics, and I was wondering if anyone knowledgeable on the subject could give me some insight.
It goes something like this:
As an anathema to reductionists, quarks (and by "quarks" I just mean, whatever are the fundamental particles of the universe) are not governed by simple rules a la conway's game of life, but rather, like all of metaphysics goes into their behavior.
The reductionist basically reduces metaphysics to the simple rules that govern quarks. Fundamentally there is no other identity or causality, everything else is just emergent from that, anything we want to call "real" that we deal with in ordinary experience, does not have any metaphysical identity or causal efficacy of its own, it's just an illusion produced by tons of atoms bouncing around. If the universe is akin to conway's game of life, then I don't think the things we see around us are actually what we think they are. They don't have any real identity on a metaphysical level, but rather they are just patterns of particles in motion, governed by mathematically simple rules.
But suppose there actually is metaphysical identity and causal power in the things around us, well the place I can see for that, is that the unknown rules governing quarks, are not mathematically simple rules, but literally that's where all of metaphysics is contained, quarks entangle together according to high level concepts corresponding to the things we see around us, including a person's identity, and have not the mathematically simple causal powers like conway's game of life, but the causal powers of the identity of the high-level agent.
The empirical question is this: do we observe the fundamental particles of the universe behaving according mathematically simple rules, or do they seem to behave in complex/unpredictable ways depending on how they are entangled / what they are interacting with?
Adding an example to clarify:
The behavior of the quarks corresponds to the identity of the things we see around us. The things we see around us are constituted by quarks - but the question is, are these quarks behaving mindlessly as billiard balls, or is their behavior the result of complex rules corresponding to the identity of the thing they form?
In other words, suppose we're talking about a living ant, are the quarks which constitute that ant behaving according to simple mathematical rules like billiard balls, and the whole concept of there being an "ant" is just an illusion produced by these particles bouncing around, or are these quarks constituting the ant actually behaving "ant-like"?
Is the causal behavior of the ant determined by the billiard-ball interactions of quarks bouncing around, or does the causal behavior actually originate in the identity of the ant, with the quark interactions being decided according to its nature?
What I'm saying is that there metaphysically is such a thing as an ant, when quarks "get together as an ant", they behave differently, they behave ant-like. Given there is a lot of unknown on exactly why quarks behave the way they do, why is this ruled out: that when they "get together as an ant", they behave ant-like?
Basically the idea is, when it comes to the interactions of the quarks constituting the ant with the quarks constituting the things the ant interacts with, the behavior of those interactions is determined not by simple, universal rules of quark behavior, but by the rules of quark behavior that are in effect "when the quarks are an ant".
To further clarify this example:
This is framed in general terms, because I don't actually know any quantum physics, but I'm talking about the fundamental physical particles ("quarks", for lack of a better term), and their behavior at the quantum level - behavior which we don't fully understand. So one could say in general terms, sometimes the quarks "swerve left" and other times they "swerve right", and we don't exactly know why they do that in any given case.
So the question is, suppose the behavior of quarks in general is not determined by simple, universal laws of quark behavior, e.g. "always swerve left 50% of the time", but rather, there are metaphysically real and physically meaningful "quark groups", like if a bunch of quarks are entangled together in a group constituting what we'd observe to be an ant, then quarks in that quark group behave differently. So for example, the quarks in that "ant quark group" might always swerve left when they interact with another quark group of a different kind.
Weekly LW Meetups
New meetups (or meetups with a hiatus of more than a year) are happening in:
Irregularly scheduled Less Wrong meetups are taking place in:
- Munich Meetup in October: 29 October 2016 04:00PM
- Stockholm: Bottlenecks to trading personal resources: 11 November 2016 05:15PM
The remaining meetups take place in cities with regular scheduling, but involve a change in time or location, special meeting content, or simply a helpful reminder about the meetup:
- Bay Area Winter Solstice 2016: 17 December 2016 07:00PM
- Moscow: rational review, status quo bias, interpersonal closeness: 30 October 2016 02:00PM
- NY Solstice 2016 - The Story of Smallpox: 17 December 2016 06:00PM
- San Francisco Meetup: Board Games: 31 October 2016 06:15PM
- Washington, D.C.: Halloween Party: 30 October 2016 03:00PM
Locations with regularly scheduled meetups: Austin, Berlin, Boston, Brussels, Buffalo, Canberra, Columbus, Denver, Kraków, London, Madison WI, Melbourne, Moscow, New Hampshire, New York, Philadelphia, Research Triangle NC, San Francisco Bay Area, Seattle, St. Petersburg, Sydney, Tel Aviv, Toronto, Vienna, Washington DC, and West Los Angeles. There's also a 24/7 online study hall for coworking LWers and a Slack channel for daily discussion and online meetups on Sunday night US time.
Your Truth Is Not My Truth
Can someone help me dissolve this, and give insight into how to proceed with someone who says this?
What are they saying, exactly? That the set of beliefs in their head that they use to make decisions is not the same set of beliefs that you use to make decisions?
Could I say something like "Yes, that's so, but how do you know that your truth matches what is in the real world? Is there some way to know that your truth isn't only true for you, and not actually true for everybody?"
I'm trying to get a feel for what they mean by "true" in this case, since it's obviously not "matching reality."
Trying to find a short story
It's a story about a boy who is into science and transhumanism, and a girl he told about all these crazy things that were going to happen. He dies and all of the things he said started to happen. She ended up floating around Saturn remembering him.
Either he or she was in the wheelchair. He was dying and he was disappointed he was dying because of all the cool stuff that was going to happen that she was going to be around for, and some of it had to do with whatever problem she had that was going to get fixed.
Please help me find this story if you can.
Weekly LW Meetups
This summary was posted to LW Main on October 14th. The following week's summary is here.
Irregularly scheduled Less Wrong meetups are taking place in:
The remaining meetups take place in cities with regular scheduling, but involve a change in time or location, special meeting content, or simply a helpful reminder about the meetup:
- Bay Area Winter Solstice 2016: 17 December 2016 07:00PM
- San Francisco Meetup: Rationality Diary: 17 October 2016 06:15PM
- Washington, D.C.: Fun & Games: 16 October 2016 03:30PM
Locations with regularly scheduled meetups: Austin, Berlin, Boston, Brussels, Buffalo, Canberra, Columbus, Denver, Kraków, London, Madison WI, Melbourne, Moscow, New Hampshire, New York, Philadelphia, Research Triangle NC, San Francisco Bay Area, Seattle, St. Petersburg, Sydney, Tel Aviv, Toronto, Vienna, Washington DC, and West Los Angeles. There's also a 24/7 online study hall for coworking LWers and a Slack channel for daily discussion and online meetups on Sunday night US time.
Weekly LW Meetups
This summary was posted to LW Main on October 7th. The following week's summary is here.
Irregularly scheduled Less Wrong meetups are taking place in:
The remaining meetups take place in cities with regular scheduling, but involve a change in time or location, special meeting content, or simply a helpful reminder about the meetup:
- Baltimore Area / UMBC Weekly Meetup: 09 October 2016 08:00PM
- Bay Area Winter Solstice 2016: 17 December 2016 07:00PM
- Melbourne: A Bayesian Guide on How to Read a Scientific Paper: 08 October 2016 03:30PM
- Moscow: rational review, bias busters, Kolmogorov and Jayes probability: 09 October 2016 02:00PM
- Washington, D.C.: Games Discussion: 09 October 2016 03:30PM
Locations with regularly scheduled meetups: Austin, Berlin, Boston, Brussels, Buffalo, Canberra, Columbus, Denver, Kraków, London, Madison WI, Melbourne, Moscow, New Hampshire, New York, Philadelphia, Research Triangle NC, San Francisco Bay Area, Seattle, Sydney, Tel Aviv, Toronto, Vienna, Washington DC, and West Los Angeles. There's also a 24/7 online study hall for coworking LWers and a Slack channel for daily discussion and online meetups on Sunday night US time.
Risk Contracts: A Crackpot Idea to Save the World
Time start: 18:17:30
I
This idea is probably going to sound pretty crazy. As far as seemingly crazy ideas go, it's high up there. But I think it is interesting enough to at least amuse you for a moment, and upon consideration your impression might change. (Maybe.) And as a benefit, it offers some insight into AI problems if you are into that.
(This insight into AI may or may not be new. I am not an expert on AI theory, so I wouldn't know. It's elementary, so probably not new.)
So here it goes, in short form on which I will expand in a moment:
To manage global risks to humanity, they can be captured in "risk contracts", freely tradeable on the market. Risk contracts would serve the same role as CO2 emissions contracts, which can likewise be traded, and ensure that the global norm is not exceeded as long as everyone plays along with the rules.
So e.g. if I want to run a dangerous experiment that might destroy the world, it's totally OK as long as I can purchase enough of a risk budget. Pretty crazy, isn't it?
As an added bonus, a risk contract can take into account the risk of someone else breaking the terms of contract. When you trasfer your rights to global risk, the contract obliges you to diminish the amount you transfer by the uncertainty about the other party being able to fullfill all obligations that come with such a contract. Or if you have not enough risk budget for this, you cannot transfer to that person.
II
Let's go a little bit more into detail about a risk contract. Note that this is supposed to illustrate the idea, not be a final say on the shape and terms of such a contract.
Just to give you some idea, here are some example rules (with lots of room to specify them more clearly etc., it's really just so that you have a clearer idea of what I mean by a "risk contract"):
- My initial risk budget is 5 * 10^-12 chance of destroying the world. I am going to track this budget and do everything in my power to make sure that it never goes below 0.
- For every action (or set of correlated actions) I take, I will subtract the probability that those actions destroy the world from my budget (using simple subtraction unless correlation between actions is very high).
- If I transfer my budget to an agent who is going to decide about its actions independently from me, I will first pay the cost from my budget for the probability that this agent might not keep the terms of the contract. I will use my best conservative estimates, and refuse the transaction if I cannot keep the risk within my budget.
- Any event in which a risk contract on world destruction is breached will use my budget as if it was equivalent to actually destroying the world.
- Whenever I create a new intelligent agent, I will transfer some risk budget to that agent, according to the rules above.
III
Of course, the application of this could be wider than just an AI which might recursively self-improve - some more "normal" human applications could be risk management in a company or government, or even using risk contract as an internal currency to make better decisions.
I admit though, that the AI case is pretty special - it gives an opportunity to actually control the ability of another agent to keep a risk contract that we are giving to them.
It is an interesting calculation to see roughly what are the costs of keeping a risk contract in the recursive AI case, with a lot of simplifying assumptions. Assume that to reduce risk of child AI going off the rails can be reduced by a constant factor (e.g. have it cut by half) by putting in an additional unit of work. Also assume the chain of child AIs might continue indefinitely, and no later AI will assume a finite ending of it. Then if the chain has no branches, we are basically reduced to a power series: the risk budget of a child AI is always the same fraction of its parent's budget. That means we need linearly increasing amount of work on safety at each step. That in turn means that the total amount of work on safety is quadratic in the number of steps (child AIs).
Time end: 18:52:01
Writing stats: 21 wpm, 115 cpm (previous: 30/167, 33/183, 23/128)
View more: Prev
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)