## ALBA: can you be "aligned" at increased "capacity"?

3 13 April 2017 07:23PM

Crossposted at the Intelligent Agents Forum.

I think that Paul Christiano's ALBA proposal is good in practice, but has conceptual problems in principle.

Specifically, I don't think it makes sense to talk about bootstrapping an "aligned" agent to one that is still "aligned" but that has an increased capacity.

The main reason being that I don't see "aligned" as being a definition that makes sense distinct from capacity.

## These are not the lands of your forefathers

Here's a simple example: let r be a reward function that is perfectly aligned with human happiness within ordinary circumstances (and within a few un-ordinary circumstances that humans can think up).

Then the initial agent - B0, a human - trains a reward r1 for an agent A1. This agent is limited in some way - maybe it doesn't have much speed or time - but the aim is for r1 to ensure that A1 is aligned with B0.

Then the capacity of A1 is increased to B1, a slow powerful agent. It computers the reward r2 to ensure the alignment of A2, and so on.

The nature of the Bj agents is not defined - they might be algorithms calling Ai for i ≤ j as subroutines, humans may be involved, and so on.

If the humans are unimaginative and don't deliberately seek out more extreme and exotic test cases, the best case scenario is for ri → r as i → ∞.

And eventually there will be an agent An that is powerful enough to overwhelm the whole system and take over. It will do this in full agreement with Bn-1, because they share the same objective. And then An will push the world into extra-ordinary circumstance and proceed to maximise r, with likely disastrous results for us humans.

## The nature of the problem

So what went wrong? At what point did the agents go out of alignment?

In one sense, at An. In another sense, at A1 (and, in another interesting sense, at B0, the human). The reward r was aligned, as long as the agent stayed near the bounds of the ordinary. As soon as it was no longer restricted to that, it went out of alignment, not because of a goal drift, but because of a capacity increase.

## Preference over null preference

-7 05 September 2016 12:47PM

Original post:  http://bearlamp.com.au/preference-over-null-preference/

For some parts of life it is better to exist with a prepared preference, for other parts of life it is better to exist without a prepared preference.  This is going to be about looking at the sets of preferences and what might be better in each scenario.

On the object level some examples:

• I like blue hair
• I don't like the colour red
• I like to eat Chinese food
• My favourite animal is a frog
• I don't like sport
• I would rather spend time in a library than a nightclub
• I love bacon icecream
• This is my favourite hat

The specific examples are irrelevant but hopefully you get the idea.  Having a preference is about having full set and reducing it to a smaller set.  Example: one colour is my favourite out of the full set of colours.

In contrast, a null preference might look like this:

• I don't care what kind of pizza we eat
• I eat anything
• I just like being with friends, it doesn't matter what we do
• I've never really had a favourite animal
• Use your best judgement for me
• I can't decide what to wear

While null preferences are technically just another form of preference, I want to separate them out for a moment so that we can talk about them.

Deciding whether you should hold a preference, even if you didn't previously have one - can be an effective strategy for making decisions where there were previously difficulties.  The benefit of having a preference is that it stands as a pre-commitment to yourself to maintain existing choices on certain choice-nodes.

The disadvantage of using this strategy is that if your preference fails to be fulfilled then you are at risk of disappointment.

If you get to the supermarket and can't decide which type of jam (or jelly for american;) to buy, you can consult an existing preference for blueberry jam and skip the whole idea of considering other types of jam.

For preference failure, if you prefer blueberry but the store doesn't have any, you risk leaving without jam, or being left with a choice less desirable like strawberry.

Nothing about your preference on the world effects the world until you interact with it.  For the jam example: deciding you like blueberry will not cause blueberry jam to be available to you.  Modifying the map will not immediately impact the territory.  If you go on an endless tirade everywhere you travel - of demanding blueberry jam, chances are that at some point people will be more motivated to be prepared for the mad blueberry jammer.  However the instant that you decide to have a preference, you have not yet caused any impact on the world.  So the instantaneous effect of having a preference is nothing.

There will be times where you are not in control of the choice nodes available to you.  There are times when you will be.  That doesn't seem relevant to which is better.

There will be times when there will be more choices and times when there will be less.  That doesn't seem relevant.

There will be times when your preference will be easily fulfilled and times when it will be hard or impossible.  That doesn't seem relevant.

There are big expensive important choices, and small irrelevant insignificant choices.  Apart from desiring to spending more time on big expensive important choices, that doesn't seem relevant either.

There will be days with more willpower and days with less willpower, (and days exist on which willpower does not deplete).

## Should I change what I do right now?

I don't know.  What do you do now?  Can we derive knowledge from the pre-existing system to make progress on what should change towards the future.

Try this:

• Think of 10 decisions you have made today (or recently). This might be the hard part.
• Were they all good decisions?  Do you already know what would have been better decisions in those places?  Can you identify the ones that need improvement already?
• For each one, think of what metric would allow you to know you have made them better.
Examples:
• Better outcomes as a result of your decision
• Faster decisions
• Decisions that take something into account
• Decisions that are more inline with high level or distant goals
• Decisions that make you look good to your peers
• Decisions that leave you feeling more of a certain feeling - free, safe, thrifty, skilled, powerful...
• For each decision and for each metric, is a better outcome likely to come from having a preference, or having a null preference.

As I get to this point I don't have a defining thesis for this concept.  I don't have an answer as to whether preference can be ruled better than a null.  Or that you might want to reduce your identity down until you have all null preferences, leading you to flexibility and freedom (of course such a reduction leads to a burden of maintaining the system of nulls as well).

I do want to open up this concept to you to decide and influence the use of the idea.  If you hold a preference too strongly you risk the possibility of making the wrong choice over and over.  If you hold it too loosely you risk indifference towards the world (maybe part of the answer is to look at where you are now in terms of decision problems and consider which direction you want to travel towards).  I don't know.

• How can this concept be used?
• What comes to mind as a powerful example of preference over null preference, or nulls over active preference?
• Where do you stand now?  And if you were to move in any direction which would it be?
• Does the application of preference/null solve or reshape any existing problems in your life?

Meta: 2 hours writing, then I got stuck when I realised I didn't have a concluding thesis (didn't actually have the answers) but I wanted to release the incomplete concept anyway to see if anyone else could come up with ideas around it.

## Preference over preference

5 06 March 2016 12:51AM

Each individual person has a preference.  Some preferences are strong, others are weak.  For many preferences it's more complicated than that; they aren’t static, and we change our preferences all the time.  Some days we don't like certain foods, sometimes we may strongly dislike a certain song then another time we may not care so much. Our preferences can change in scope, as well as intensity.

Sometimes people can have preferences over other people's preferences.

• Example 1: I prefer to be surrounded by people who enjoy exercise, that way I will be motivated to exercise more.
• Example 2: I prefer to be surrounded by people who don't care how they look, that way I look prettier than everyone else.
• Example 3: I prefer when other people like my clothes.
• Example 4: I prefer my partners to be polyamorous.
• Example 5: I prefer people around me to not smoke.

The interesting thing about example 3; is that there are multiple ways to achieve that preference:

1. Find out what clothes people like and acquire those clothes, then wear them regularly.
2. Find people who already like the clothes that you have, then hang around those people regularly.
3. Change the preference of the people around you so that they like your clothes.

Changing someone’s preference over clothing seems pretty harmless, and that way you get to wear clothes you like, they get to like the clothes you wear, and you get to be around people who like the clothes you wear without finding new people. The scary and maybe uncomfortable thing is that the other preferences can be also achieved through these means.

Example 4:

1. Find out where poly people are, and hang out with them. (and ask to be their partners - etc)
2. Find out which of the people you know are already poly and hang out with them  (and ask to be their partners - etc)
3. Change the preferences of your existing partner/s.

Example 1:

1. Find out where people who enjoy exercise hang out, and join them.
2. Find out which of your friends already enjoy exercise and hang out with them.
3. Change the preferences of those around you to also enjoy exercise.

Example 5:

1. Find out where people don't smoke, hang out in those places.
2. Figure out who already doesn't smoke and hang out with them.
3. Encourage people you know to not smoke.

(I think that's enough examples)

## Is it wrong?

There is nothing inherently wrong with having a preference. Having a preference over another person’s preference is also not inherently wrong.  Such is the nature of having a preference (usually a strong one by the time you are dictating it to your surroundings).  What really matters is what you do about it.

In this day and age; no one would be discouraged from figuring out where people are not smoking and being in those places instead of the smoking places.  In this day and age you wouldn't be criticised for finding out which of your friends don't smoke and only hanging out with them either - but maybe it makes some people uncomfortable to do it, or to feel that the reciprocal might happen if someone strongly didn't like their preferences.  In this day and age; encouraging those around you to not smoke can come across as an action with questionable motives.

So let's look at some of the motives:

1. I prefer it when people don't smoke around me because then I don't get second hand smoke.
2. I prefer it when my friends don't smoke because I don't like chemical dependency in my environment.
3. I prefer it when my friends don't smoke so that we look better than that other group of people who do smoke.
4. I prefer it when my friends don't smoke because I don't want them to get cancer and die (and not be around to be my friends any more).

Motive 1 seems very much about self-preservation.  We can't really fault an entity for trying to self-preserve.

Motive 2 is a more broad example of self-preservation - the idea that having dependency in your environment might negatively impact you enough to warrant the need to maintain an environment without it - it's a stretch, but not an unreasonable self-preservation drive.

Motive 3 appears to be a superficial drive to be better than other people.  We often don't like admitting that this is the reason we do things; but I don't mind it either.  If it were me; I'd get pretty tired of being motivated by *keeping up with the Joneses* type attitudes but some people care greatly about that.

Motive 4 seems like a potentially altruistic desire to protect your friends; but then it seems less so once you include the bracketed sub-motive.

Herein lies the problem.  If a preference looks like it is designed to improve someone else's life like "others shouldn't smoke" (remember that "looks like to me" is equivalent to "I believe it looks like..."), and we believe that having a preference over their preference would improve their life - should we enforce that preference?  Do we have a right or even a burden to encourage those around us to quit smoking? To take up exercise?  To become poly?  To like us (or our clothes)?

The idea of preference over preference is a big one.  What if my preference is that people eat my birthday cake? and Bob’s preference is that he sticks to his diet today?  Who should win?  It’s My Birthday. On Bob’s birthday he doesn’t have to eat cake, but on My Birthday he does.  Or does he?

The truth is neither way is the best way.  Sometimes hypothetical bob should eat the birthday cake and sometimes hypothetical birthday-kid should respect other people’s dietary choices.  What we really have control over is our own preference for ourselves.  My only advice it to tread delicately when having preferences over other people’s preferences.

If we think we know better (and we might but also might not) and are trying to uphold a preference over a preference (p/p), then what happens?

Either we are right, we are wrong, or something else happens.  And depends on whether the other party conformed or not (or did something else).  Then what happens when things resolve.

Examples:

1. A is smoking
2. B says not to because it's bad for you
3. A doesn't stop
4. It turns out to be bad for you
5. A gets sick

B was right, tried to push a p/p and lost.  (either by not pushing hard enough or by A being stubborn). Did the p/p serve any good here?  Should it have happened?  What if an alternative 5 exists; “A keeps smoking, never gets sick and lives to 90”.  Then was the p/p useful?

1. A is monogamous
2. B says to be poly
3. A does
5. A is hurt

B was wrong, tried to push a p/p and won.  But was wrong and shouldn't have pushed it? Or maybe A shouldn't have conformed.

This can be represented in a table:

 B prefers to maintain P/P B does not maintain P/P A is susceptible to pressure A gives in A does not change (because there is no pressure) A is not susceptible A does not change (stubborn) A does not change (because there is no pressure)

And a second table of results:

 change was negative (or caused a negative result) change was positive (or caused a positive result) A is susceptible A loses. A wins! A is not susceptible A wins! A loses.

Assuming also that if A loses; B takes a hit as well.  Ideally we want everyone to win all the time. But just showing these things in a table is not enough.  We should be assigning estimated probability to these choices as well.

For example (my made up numbers of whether I think smoking will lead to a bad result):

 Smoking: 98% smoking causes problems 2% smoking does not cause problems.

If we edit the earlier table:

 Smoking B prefers to maintain P/P B does not maintain P/P A is susceptible to pressure A gives in (2% estimate that the change was pointless) A does not change (because there is no pressure) (98% estimate that this is a bad outcome) A is not susceptible to pressure A does not change (stubborn) (98% estimate that this is a bad outcome) A does not change (because there is no pressure) (98% estimate that this is a bad outcome)

To a rationalist; seeing your p/p table with estimates should help to understand whether they should take you up on fulfilling your preference or not.  Assuming of course that rationalists never lie; and can accurately estimate the confidence of their beliefs.

If you meet someone with a 98% belief they should be able to produce evidence that will also reasonably convince you of similar ideas and encourage you to update your beliefs.  So maybe in the smoking case A should be listening to B; or checking the evidence very seriously.

What should you do when you hold a strong p/p that will be to your benefit at the same time as being to someone else’s detriment.  (and part 2: what if you are unsure of the benefit or detriment)

Examples:

B want's A to try a new street drug "splice".  B says it's lots of fun and encourages A to do it.  B is unsure of the risks; but sure of the benefits (lots of fun).  Should B encourage A? (what more do we need to know to make that sort of judgement call?)

B has a sexual interest that is specific, and A’s are indifferent B could easily encourage A to "try out this".  should B?

B has an old crappy car that B doesn’t like very much.  B prefers to make friends with shady A’s who will steal the car.  then B can claim on insurance that it was stolen. and get a nicer care with the payout.  Should B?

B wants A to pay for the two of them to go on a carnival ride.  the cost is simple (several dollars) the benefit is not.  Should B pressure A?  (what more do we need to know in order to answer that question?)

A always crosses the street dangerously because they are often running late.  B believes that A should be more safe - walk a distance to the nearest crossing before crossing the road; B knows that this will make A late.  Should B pressure A? (will more information help us answer?)

It was suggested that the Veil of ignorance might help to create a rule in this situation.  However the bounds of this situation dictate that you know which party you are; and that you have a preference over a preference.  So the Veil of ignorance does not so much apply to give us insight.

1. It is possible to be a selfish entity, hold p/p and encourage others to fulfil your preference
2. it is also possible to be a non-influential entity, and never push a preference over others.
3. it is possible to be a stubborn entity and never conform to someone else’s p/p.
4. It is also possible to be a conforming entity and always conform.

It is also possible to be a mix of these 4 in different situations and/or different preferences.

## Partial Solution

In conclusion there are no rules to be drawn around p/p other than - Try to understand it; and how it can go wrong and be careful.

Meta: 4.5 hours to write, 30mins to take feedback and edit.  Thanks to the slack for being patient while I asked tricky example questions.

## Opinions on having a stronger preference or an open preference

4 07 May 2015 09:06PM

I am wondering; and the answer seems unclear to me.

All day; every day of our lives we are presented with choice moments;  What to eat; which way to turn; which clock to look at to best tell the time, what colour clothing to wear, where to walk; what to say, which words to say it with; how to respond.

Is it better to have an "established specific preference" or a more "open preference"?   Why?  Or which places is one better than the other and vice versa?

some factors might include:

Mental energy: Mental energy is exerted by having to make choices regularly; but with existing preferences you can simplify indecisive moments, save time; save mental energy; which can lead to bad choices and akrasia-like habits. http://en.wikipedia.org/wiki/Decision_fatigue

Lost opportunity: When walking well worn pathways; you are unlikely to encounter as many new opportunities as you might have otherwise encountered if exploring new experiences or trying new choices.

Establishing stability: From a scientific standpoint; establishing stable norms could allow you to better measure variations and their cause/effect on your life, (i.e. food eaten and your mood).  As many of us are growing, measuring and observing the world around us and our relationship with it; perhaps its better to establish stable choice for more areas.

I assume that once a choice is established; it would take an amount of activation energy to justify changing that choice, so would be partially fixed.  (not to mention all the biases which would convince you that it was a good choice)

If the choice was made: Imagine if someone else made the choice for you.  Would it be easier?  Is this a good measure as to whether this choice should be pre-decided?

Would it be a productive exercise to make a list of daily choices and make pre-defined decisions for yourself so that you don't have to make them as they come up; but rather look at an existing choice list?  Would this help with decisions?  For example dieting should be easier when you have your dietary choice already established.  Of course in the real world decisions can change; as life moves rapidly; but maybe it can help to have an existing default.

Is there a place in your life that you find having a pre-defined choice to be effective/ineffective?  One example might be a shopping list; where you can shop faster with a list of specific products rather than wander round the supermarket asking yourself; "Do I want this; do I need this?"  Does anyone find a shopping list like this to be ineffective?

(I realised that "better" is not a useful term for defining something because there are many directions of betterness, but I leave that up to the responder to describe in which ways having a choice in certain areas are better) (there are several questions in this piece, feel free to answer some, all, or none)  (I will be trying to make a poll shortly)

--------

Possible experimental test:

1. Spend 10 minutes making a list of all the choice moments that you experienced yesterday/over the past two days, including the choices you made and possible other options.

2. Consider better choices for each of these "choice moments" (10mins)

3. Make a list of all the choices you expect to make tomorrow (spend 10 mins), and the choices you will make in each.

4. Consider alternative choices to the ones chosen (10mins), and pick a final choice that you are going to make.

5. Write a list (as a cheat-sheet) for ease of taking it with you.

6. Carry out a day of pre-chosen choices.

7. Report back.

## [link] Book review: Mindmelding: Consciousness, Neuroscience, and the Mind’s Privacy

3 29 July 2013 01:47PM

http://kajsotala.fi/2013/07/book-review-mindmelding-consciousness-neuroscience-and-the-minds-privacy/

I review William Hirstein's book Mindmelding: Consciousness, Neuroscience, and the Mind’s Privacy, which he proposes a way of connecting the brains of two different people together so that when person A has a conscious experience, person B may also have the same experience. In particular, I compare it to my and Harri Valpola's earlier paper Coalescing Minds, in which we argued that it would be possible to join the brains of two people together in such a way that they'd become a single mind.

Fortunately, it turns out that the book and the paper are actually rather nicely complementary. To briefly summarize the main differences, we intentionally skimmed over many neuroscientific details in order to establish mindmelding as a possible future trend, while Hirstein extensively covers the neuroscience but is mostly interested in mindmelding as a thought experiment. We seek to predict a possible future trend, while Hirstein seeks to argue a philosophical position: Hirstein focuses on philosophical implications while we focus on societal implications. Hirstein talks extensively about the possibility of one person perceiving another’s mental states while both remaining distinct individuals, while we mainly discuss the possibility of two distinct individuals coalescing together into one.

I expect that LW readers might be particularly interested in some of the possible implications of Hirstein's argument, which he himself didn't discuss in the book, but which I speculated on in the review:

Most obviously, if another person’s conscious states could be recorded and replayed, it would open the doors for using this as entertainment. Were it the case that you couldn’t just record and replay anyone’s conscious experience, but learning to correctly interpret the data from another brain would require time and practice, then individual method actors capable of immersing themselves in a wide variety of emotional states might become the new movie stars. Once your brain learned to interpret their conscious states, you could follow them in a wide variety of movie-equivalents, with new actors being hampered by the fact that learning to interpret the conscious states of someone who had only appeared in one or two productions wouldn’t be worth the effort. If mind uploading was available, this might give considerable power to a copy clan consisting of copies of the same actor, each participating in different productions but each having a similar enough brain that learning to interpret one’s conscious states would be enough to give access to the conscious states of all the others.

The ability to perceive various drug- or meditation-induced states of altered consciousness while still having one’s executive processes unhindered and functional would probably be fascinating for consciousness researchers and the general public alike. At the same time, the ability for anyone to experience happiness or pleasure by just replaying another person’s experience of it might finally bring wireheading within easy reach, with all the dangers associated with that.

A Hirstein-style mind meld might possibly also be used as an uploading technique. Some upload proposals suggest compiling a rich database of information about a specific person, and then later using that information to construct a virtual mind whose behavior would be consistent with the information about that person. While creating such a mind based on just behavioral data makes questionable the extent to which the new person would really be a copy of the original, the skeptical argument loses some of its force if we can also include in the data a recording of all the original’s conscious states during various points in their life. If we are able to use the data to construct a mind that would react to the same sensory inputs with the same conscious states as the original did, whose executive processes would manipulate those states in the same ways as the original, and who would take the same actions as the original did, would that mind then not essentially be the same mind as the original mind?

Hirstein’s argumentation is also relevant for our speculations concerning the evolution of mind coalescences. We spoke abstractly about the ”preferences” of a mind, suggesting that it might be possible for one mind to extract the knowledge from another mind without inherting its preferences, and noting that conflicting preferences would be one reason for two minds to avoid coalescing together. However, we did not say much about where in the brain preferences are produced, and what would be actually required for e.g. one mind to extract another’s knowledge without also acquiring its preferences. As the above discussion hopefully shows, some of our preferences are implicit in our automatic habits (the things that we show we value with our daily routines), some in the preprocessing of sensory data that our brains carry out (the things and ideas that are ”painted with” positive associations or feelings), and some in the configuration of our executive processes (the actions we actually end up doing in response to novel or conflicting situations). (See also.) This kind of a breakdown seems like very promising material for some neuroscience-aware philosopher to tackle in an attempt to figure out just what exactly preferences are; maybe someone has already done so.

## Subjective expected utility without preferences

2 14 February 2012 03:04AM

In the latest issue of Journal of Mathematical Psychology, Denis Bouyssou and Thieery Marchant provide a model for subjective expected utility without preferences. Abstract:

This paper proposes a theory of subjective expected utility based on primitives only involving the fact that an act can be judged either ‘‘attractive’’ or ‘‘unattractive’’. We give conditions implying that there are a utility function on the set of consequences and a probability distribution on the set of states such that attractive acts have a subjective expected utility above some threshold. The numerical representation that is obtained has strong uniqueness properties.

PDF.

## Help with a (potentially Bayesian) statistics / set theory problem?

2 10 November 2011 10:30PM

Update: as it turns out, this is a voting system problem, which is a difficult but well-studied topic. Potential solutions include Ranked Pairs (complicated) and BestThing (simpler). Thanks to everyone for helping me think this through out loud, and for reminding me to kill flies with flyswatters instead of bazookas.

I'm working on a problem that I believe involves Bayes, I'm new to Bayes and a bit rusty on statistics, and I'm having a hard time figuring out where to start. (EDIT: it looks like set theory may also be involved.) Your help would be greatly appreciated.

Here's the problem: assume a set of 7 different objects. Two of these objects are presented at random to a participant, who selects whichever one of the two objects they prefer. (There is no "indifferent" option.) The order of these combinations is not important, and repeated combinations are not allowed.

Basic combination theory says there are 21 different possible combinations: (7!) / ( (2!) * (7-2)! ) = 21.

Now, assume the researcher wants to know which single option has the highest probability of being the "most preferred" to a new participant based on the responses of all previous participants. To complicate matters, each participant can leave at any time, without completing the entire set of 21 responses. Their responses should still factor into the final result, even if they only respond to a single combination.

At the beginning of the study, there are no priors. (CORRECTION via dlthomas: "There are necessarily priors... we start with no information about rankings... and so assume a 1:1 chance of either object being preferred.) If a participant selects B from {A,B}, the probability of B being the "most preferred" object should go up, and A should go down, if I'm understanding correctly.

NOTE: Direct ranking of objects 1-7 (instead of pairwise comparison) isn't ideal because it takes longer, which may encourage the participant to rationalize. The "pick-one-of-two" approach is designed to be fast, which is better for gut reactions when comparing simple objects like words, photos, etc.

The ideal output looks like this: "Based on ___ total responses, participants prefer Object A. Object A is preferred __% more than Object B (the second most preferred), and ___% more than Object C (the third most preferred)."

Questions:

1. Is Bayes actually the most straightforward way of calculating the "most preferred"? (If not, what is? I don't want to be Maslow's "man with a hammer" here.)

2. If so, can you please walk me through the beginning of how this calculation is done, assuming 10 participants?