Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Comment author: entirelyuseless 22 November 2017 03:24:20PM 1 point [-]

I think we should use "agent" to mean "something that determines what it does by expecting that it will do that thing," rather than "something that aims at a goal." This explains why we don't have exact goals, but also why we "kind of" have goals: because our actions look like they are directed to goals, so that makes "I am seeking this goal" a good way to figure out what we are going to do, that is, a good way to determine what to expect ourselves to do, which makes us do it.

Comment author: Stuart_Armstrong 27 November 2017 11:44:18AM 0 points [-]

Seems a reasonable way of seeing things, but not sure it works if we take that definition too formally/literally.

Comment author: JenniferRM 23 November 2017 07:51:49PM *  0 points [-]

Initially I wrote a response spelling out in excruciating detail an example of a decent chess bot playing the final moves in a game of Preference Chess, ending with "How does this not reveal an extremely clear example of trivial preference inference, what am I missing?"

Then I developed the theory that what I'm missing is that you're not talking about "how preference inference works" but more like "what are extremely minimalist preconditions for preference inference to get started".

And given where this conversation is happening, I'm guessing that one of the things you can't take for granted is that the agent is at all competent, because sort of the whole point here is to get this to work for a super intelligence looking at a relatively incompetent human.

So even if a Preference Chess Bot has a board situation where it is one move away from winning, losing, or taking another piece that it might prefer to take... no matter what move the bot actually performs you could argue it was just a mistake because it couldn't even understand the extremely short run tournament level consequences of whatever Preference Chess move it made.

So I guess I would argue that even if any specific level of stable state intellectual competence or power can't be assumed, you might be able to get away with a weaker assumption of "online learning"?

It will always be tentative, but I think it buys you something similar to full rationality that is more likely to be usefully true of humans. Fundamentally you could use "an online learning assumption" to infer "regret of poorly chosen options" from repetitions of the same situation over and over, where either similar or different behaviors are observed later in time.

To make the agent have some of the right resonances... imagine a person at a table who is very short and wearing a diaper.

The person's stomach noisily grumbles (which doesn't count as evidence-of-preference at first).

They see in front of them a cupcake and a cricket (the eye's looking at both is somewhat important because it means they could know that a choice is even possible, allowing us to increment the choice event counter here).

They put the cricket in their mouth (which doesn't count as evidence-of-preference at first).

They cry (which doesn't count as evidence-of-preference at first).

However, we repeat this process over and over and notice that by the 50th repetition they are reliably putting the cupcake in their mouth and smiling afterwords. So we use the relatively weak "online learning assumption" to say that something about the cupcake choice itself (or the cupcake's second order consequences that the person may think semi-reliably reliably happens) are more preferred than the cricket.

Also, the earlier crying and later smiling begin to take on significance as either side channel signals of preference (or perhaps they are the actual thing that is really being pursued as a second order consequence?) because of the proximity of the cry/smile actions reliably coming right after the action whose rate changes over time from rare to common.

The development of theories about side channel information could make things go faster as time goes on. It might even becomes the dominant mode of inference, up to the point where it starts to become strategic, as with lying about one's goals in competitive negotiation contexts becoming salient once the watcher and actor are very deep into the process...

However, I think your concern is to find some way to make the first few foundational inferences in a clear and principled way that does not assume mutual understanding between the watcher and the actor, and does not assume perfect rationality on the part of the actor.

So an online learning assumption does seem to enable a tentative process, that focuses on tiny little recurring situations, and the understanding of each of these little situations as a place where preferences can operate causing changes in rates of performance.

If a deeply wise agent is the watcher, I could imagine them attempting to infer local choice tendencies in specific situations and envisioning how "all the apparently preferred microchoices" might eventually chain together into some macro scale behavioral pattern. The watcher might want to leap to a conclusion that the entire chain is preferred for some reason.

It isn't clear that the inference to the preference for the full chain of actions would be justified, precisely because of the assumption of the lack of full rationality.

The watcher would want to see the full chain start to occur in real life, and to become more common over time when chain initiation opportunities presented themselves.

Even then, the watcher might even double check by somehow adding signposts to the actor's environment, perhaps showing the actor pictures of the 2nd, 4th, 8th, and 16th local action/result pairs that it thinks are part of a behavioral chain. The worry is that the actor might not be aware how predictable they are and might not actually prefer all that can be predicted from their pattern of behavior...

(Doing the signposting right would require a very sophisticated watcher/actor relationship, where the watcher had already worked out a way to communicate with the actor, and observed the actor learning that the watcher's signals often functioned as a kind of environmental oracle for how the future could go, with trust in the oracle and so on. These preconditions would all need to be built up over time before post-signpost action rate increases could be taken as a sign that the actor preferred performing the full chain that had been signposted. And still things could be messed up if "hostile oracles" were in the environment such that the actor's trust in the "real oracle" is justifiably tentative.)

One especially valuable kind of thing the watcher might do is to search the action space for situations where a cycle of behavior is possible, with a side effect each time through the loop, and to put this loop and the loop's side effect into the agent's local awareness, to see if maybe "that's the point" (like a loop that causes the accumulation of money, and after such signposting the agent does more of the thing) or maybe "that's a tragedy" (like a loop that causes the loss of money, that might be a dutch booking in progress, and after signposting the agent does less of the thing).

Is this closer to what you're aiming for? :-)

Comment author: Stuart_Armstrong 27 November 2017 11:43:14AM 1 point [-]

I'm sorry, I have trouble following long posts like that. Would you mind presenting your main points in smaller, shorter posts? I think it would also make debate/conversation easier.

Comment author: JenniferRM 22 November 2017 01:03:32AM *  1 point [-]

Perhaps I'm missing something, but it seems like "agent H" has nothing to do with an actual human, and that the algorithm and environment as given support even less analogy to a human than a thermostat.

Thus, proofs about such a system are of almost no relevance to moral philosophy or agent alignment research?

Thermostats connected to heating and/or cooling systems are my first goto example for asking people where they intuitively experience the perception of agency or goal seeking behavior. I like using thermostats as the starting point because:

  1. Their operation has clear connections to negative feedback loops and thus obvious "goals" because they try to lower the temperature when it is too hot and try to raise the temperature when it is too cold.

  2. They have internally represented goals, because their internal mechanisms can be changed by exogenous-to-the-model factors that change their behavior in response to otherwise identical circumstances. Proximity plus non overlapping ranges automatically lead to fights without any need for complex philosophy.

  3. They have a natural measure of "optimization strength" in the form of the wattage of their heating and cooling systems, which can be adequate or inadequate relative to changes in the ambient temperature.

  4. They require a working measurement component that detects ambient temperature, giving a very limited analogy for "perception and world modeling". If two of thermostats are in a fight, a "weak and fast" thermostat can use a faster sampling rate to get a headstart on the "slower stronger" thermostat that put the temperature where it wanted and then rested for 20 minutes before measuring again. This would predictably give a cycle of temporary small victories for the fast one that turn into wrestling matches that it always loses, over and over.

I personally bite the bullet and grant that thermostats are (extremely minimal) agents with (extremely limited) internal experiences, but I find that most people I talk about this with do not feel comfortable admitting that these might be "any kind of agent".

Yet the thermostat clearly has more going on than "agent H" in your setup.

A lot of people I talk with about this are more comfortable with a basic chess bot architecture than a thermostat, when talking about the mechanics of agency, because:

  1. Chess bots consider more than a simple binary actions.

  2. Chess bots generate iterated tree-like models of the world and perform the action that seems likely to produce the most preferred expected long term consequence.

  3. Chess bots prune possible futures such that they try not to do things that hostile players could exploit now or in the iterated future, demonstrating a limited but pragmatically meaningful theory of mind.

Personally, I'm pretty comfortable saying that chess bots are also agents, and they are simply a different kind of agent than a thermostat, and they aren't even strictly "better" than thermostats because thermostats have a leg up on them in having a usefully modifiable internal representation of their goals, which most chess bots lack!

An interesting puzzle might be how to keep much of the machinery of chess, but vary the agents during the course of their training and development so that they have skillful behavioral dynamics but different chess bot's skills are organized around things like a preference to checkmate the opponent while they still have both bishops, but lower down their hierarchy of preferences is preferring to be checkmated while retaining both bishops versus, and even further down is losing any bishops and also being checkmated.

Imagine a tournament of 100 chess bots where the rules of chess are identical for everyone, but some of the players are in some sense "competing in different games" due to a higher level goal of beating the chess bots that have the same preferences as them. So there might be bishop keepers, bishop hunters, queen keepers, queen hunters, etc.

Part of the tournament rules is that it would not be public knowledge who is in which group (though the parameters of knowledge could be an experimental parameter).

And in a tournament like that I'm pretty sure that any extremely competitive bishop keeping chess bot would find it very valuable to be able to guess from observation of the opponents early moves that in a specific game they might be playing a rook hunting chessbot that would prefer to capture their rook and then be checkmated than to officially "tie the game" without ever capturing one of their rooks.

In a tournament like this, keeping your true preferences secret and inferring your opponent's true preferences would both be somewhat useful.

Some overlap in the game should always exist (like preference for win > tie > lose all else equal) and competition on that dimension would always exist.

Then if any AgentAlice knows AgentBob's true preferences she can probably see deeper into the game tree than otherwise by safely pruning more lines of play out of the tree, and having a better chance of winning. On the other hand mutual revelation of preferences might allow gains from trade, so it isn't instantly clear how to know when to reveal preferences and when to keep them cryptic...

Also, probably chess is more complicated than is conceptually necessary. Qubic (basically tic tac toe on a 4x4x4 grid) probably has enough steps and content to allow room for variations in strategy (liking to have played in corners, or whatever) so that the "preference" aspects could hopefully dominate the effort put into it rather than demanding extensive and subtle knowledge of chess.

Since qubic was solved at least as early as 1992, it should probably be easier to prove things about "qubic with preferences" using the old proofs as a starting point. Also it is probably a good idea to keep in mind which qubic preferences are instrumentally entailed by the pursuit of basic winning, so that preferences inside and outside those bounds get different logical treatment :-)

Comment author: Stuart_Armstrong 22 November 2017 01:33:43PM *  1 point [-]

Thanks! But H is used as an example, not a proof.

And the chessbots actually illustrate my point - is a bishop-retaining chessbot actually intending to retain their bishop, or is it an agent that wants to win, but has a bad programming job which inflates the value of bishops?

Comment author: JenniferRM 03 November 2017 06:19:10AM *  0 points [-]

I see how arguments that "the great filter is extremely strong" generally suggests that any violent resistance against an old race of exterminators is hopeless.

However it seems to me as if the silent sky suggests that everything is roughly equally hopeless. Maybe I'm missing something here, and if so I'd love to be corrected :-)

But starting from this generic evidential base, if everything is hopeless because of the brute fact of the (literally astronomically) large silent sky (with the strength of this evidence blocking nearly every avenue of hope for the future), I'm reasonably OK with allocating some thought to basically every explanation of the silent sky that has a short description length, which I think includes the pessimistic zoo hypothesis...

Thinking about this hypothesis might suggest methods to timelessly coordinate with other "weed species"? And this or other thoughts might suggest new angles on SETI? What might a signal look like from another timelessly coordinating weed species? This sort of thinking seems potentially productive to me...

HOWEVER, one strong vote against discussing the theory is that the pessimistic zoo hypothesis is an intrinsically "paranoid" hypothesis. The entities postulated include an entity of unknown strength that might be using its strength to hide itself... hence: paranoia.

Like all paranoid theories there is a sort of hope function where each non-discovery of easy/simple evidence for the existence of a hostile entity marginally increases both (1) the probability that the entity does not exist, and (2) the probability that if the entity exists it is even better at hiding from you than you had hypothesized when you searched in a simple place with the mild anticipation of seeing it.

At the end of a fruitless but totally comprehensive search of this sort you either believe that the entity does not physically exist, or else you think that it is sort of "metaphysically strong".

The recently popular "Three Body Problem" explores such paranoia a bit with regard to particle physics. Also, the powers seen in the monolith of Clarke's "2001" comes to mind (although that seemed essentially benevolent and weak compared to what might be seen in a fully bleak situation) and Clarke himself coined the phrase claiming "sufficiently advanced technology is indistinguishable from magic" in order partly to justify some of what he wrote as being respectable-enough-for-science-fiction I think.

This brings up a sort of elephant in the room: paranoid hypotheses are often a cognitive tarpit that captures the fancy of the mentally ill and/or theologically inclined people.

The hallmarks of bad thinking here tend to be (1) updating too swiftly in the direction of extreme power on the part of the hidden entity, (2) getting what seem like a lot of false positives when analyzing situations where the entity might have intervened, and (3) using the presumed interventions to confabulate motives.

To discuss a paranoid hypothesis in public risks the speaker becoming confused in the mind of the audience with other people who entertain paranoid hypotheses with less care.

It would make a lot of sense to me to me if respectable thinkers avoid discussing the subject for this reason.

If I was going to work here in public, I think it would be useful to state up front that I'd refrain from speculating about precise motives for silencing weed species like we might be. Also, if I infer extremely strong aliens I'm going hold off on using their inferred strength to explain anything other than astronomy data, and even that only reluctantly.

Also, I'd start by hypothesizing aliens that are extremely weak and similar to conventionally imaginable human technology that might barely be up to the task of suppression, and thoroughly rule that level of power out before incrementing the hypothesized power by a small amount.

Comment author: Stuart_Armstrong 06 November 2017 08:36:47PM 1 point [-]

However it seems to me as if the silent sky suggests that everything is roughly equally hopeless.

Unless we assume the filter is behind us.

Also, I'd start by hypothesizing aliens that are extremely weak and similar to conventionally imaginable human technology

Just by the fact they can cross between the stars imply they can divert an asteroid to slam into the Earth. This gives an idea what we'd need to do to defend against them, in theory.

Comment author: JenniferRM 28 October 2017 07:46:35AM *  1 point [-]

I'm tempted to suggest that the field of interstellar futurology has two big questions that both have very wide error bars which each, considered one at a time, suggest the need for some other theory (outside the horizon of common reasoning) to produce an answer.

It makes me wonder how plausible it is that these questions are related, and help answer each other:

(1) How many other species are out there for us to meet?

(2) Will we will ever go out there or not?

For the first question, Occam suggests that we consider small numbers like "0" or "1", or else that we consider simple evolutionary processes that can occur everywhere and imply numbers like "many".

Observational evidence (as per Fermi) so far rules out "many".

Our own late-in-the-universe self-observing existence with plausible plans for expansion into space (which makes the answer to the second questions seem like it could be yes) suggests that 0 aliens out there is implausible... so what about just going with 1?

This 1 species would not be "AN alien race" but rather "THE alien race". They would be simply the one minimal other alien race whose existence is very strongly implied by minimal evidence plus logical reasoning.

Looping back to the second question of interstellar futurology (and following Occam and theoretical humility in trying to keep the number of theoretical elements small) perhaps the answer to whether our descendants will be visible in the skies of other species is "no with 99.99% probability" because of THE alien race.

When I hear "the zoo hypothesis" this logically simple version, without lots of details, is what I usually think of: Simply that there is "some single thing" and for some reason it makes the sky empty and forecloses our ever doing anything that would make the sky of another species NOT empty.

However, wikipedia's zoo hypothesis is full of crazy details about politics and culture and how moral progress is going to somehow make every species converge on the one clear moral rule of not being visible to any other species at our stage or below, so somehow we ourselves (and every single other species among the plausibly "many") are also in some sense "THE (culturally convergently universal) species" which is the space civilization that sprouts everywhere and inevitably convergently evolves into maintaining the intergalactic zoo.

Yeah. This is all very nice... but it seems both very detailed and kind of hilariously optimistic... like back when the Soviet Union's working theory was that of course the aliens would be socialist... and then data came in and they refused to give up on the optimism even though it no longer made sense, so they just added more epicycles and kept chugging away.

I'm reminded of the novels of Alastair Reynolds where he calls THE alien race "The Inhibitors".

Reynolds gave them all kinds of decorative details that might be excused by the demand that commercial science fiction have dramatically compelling plots... However one of their details was that they were a galactic rather than intergalactic power. This seems like a really critical strategic fact that can't be written off as a detail added for story drama, and so that detail counts against the science side of his work. Too much detail of the wrong sort!

In the spirit of theoretical completeness, consider pairing the "optimistic zoo theory" with a more "pessimistic zoo theory".

In the pessimistic version THE intergalactic alien race is going to come here and kill us. Our chance of preventing this extermination is basically "the number of stars we see that seem to have been the origin of a a visible and friendly intergalactic civilization (plus one as per Laplace's rule of succession) divided by the number of stars where a civilization with this potential could have developed".

By my count our chance of surviving using this formula would be ((0 + 1) / 10 ^ BIG).

Like if it was you versus a weed in your garden, the weed's chances of surviving you are better than humanity's chances of surviving THE aliens.

Lower? Yes! At least individual weeds evolved under the selection pressure of animal grazing, and individual weeds have a plant genome full of survival wisdom to use to fight a human weeder, who is doing something more or less similar to what grazing animals do.

So the only strong observational argument I can see against the pessimistic zoo theory is that if it were true, then to square with what we see we have to suppose that THE alien weeders would bother with camouflage that SETI can't penetrate.

Consider all the potentially valuable things they could do with the universe that would tip us off right away, and then consider the cost of being visible. Would it be worth it for THE aliens (the first and only old intergalactic alien race) to hide in this way? I would not naively expect it.

Naively I'd have thought that the shape of the galaxy and its contents would be whatever they wanted it to be, and that attempts to model galactic orbital and/or stellar histories would point the finger at a non-obvious causes, with signs of design intent relative to some plausible economic goal. Like this work but with more attention to engineering intent.

So a good argument against this kind of pessimism seems like it would involve calculation of the costs and benefits of visible projects versus the benefits and costs of widespread consistent use of stealth technology.

If stealth is not worth it, then the Inhibitors (or Weeders or whatever you want to call THE aliens) wouldn't bother with hiding and the lack of evidence of their works would be genuine evidence that they don't exist.

maybe you can do something that breaks the symmetry from the timeless decision theory perspective like send a massive signal to the galaxy...

The pessimistic zoo theory makes this proposal seem heroic to me :-)

The hard part here seems like it would be to figure out if there is anything humans can possibly build in the next few decades (or centuries?) that might continue to send a signal for the next 10 million years (in a way we could have detected in the 1970s) and that will continue to function despite THE alien race's later attempts to turn it off after they kill us because it messes up their stealth policy.

My guess is that the probability of an enduring "existence signal" being successfully constructed and then running for long enough to be detected by many other weed species is actually less than the probability that we might survive, because an enduring signal implies a kind of survival.

By contrast, limited "survival" might happen if samples of earth are taken just prior to a basically successful weeding event...

Greg Bear's "The Forge Of God" and sequel "Anvil of Stars" come to mind here. In those books Bear developed an idea that space warfare might be quite similar to submarine warfare, with silence and passive listening being the fundamental rule, most traps and weapons optimized for anonymous or pseudo-natural deployment, and traceable high energy physical attacks with visibly unnatural sources very much the exception.

As with all commercially viable books, you've got to have hope in there somewhere, so Bear populated the sky with >1 camouflaged god-like space civilizations that arrive here at almost precisely the same time, and one of them saves us in a way that sort of respects our agency but it leaves us making less noise than before. This seems optimistic in a way that Occam would justifiably complain about, even as it makes the story more fun for humans to read...

Comment author: Stuart_Armstrong 31 October 2017 04:36:37PM 1 point [-]

pessimistic zoo theory

I've thought about things like that before, but always dismissed them, not as wrong but as irrelevant - there is nothing that can be done about that, as they would certainly have a fully armed listening post somewhere in the solar system to put us down when the time comes (though the fact they haven't yet is an argument against their existence).

But since there's nothing to be done, I ignore the hypothesis in practice.

Comment author: Stuart_Armstrong 25 October 2017 04:07:17PM 1 point [-]

I suggest you check with Nate what exactly he thinks, but my opinion is:

If two decision algorithms are functionally equivalent, but algorithmically dissimilar, you'd want a decision theory that recognises this.

I think Nate agrees with this, and any lack of functional equivalence is due to not being able to fully specify that yet.

f and f' are functionally correlated, but not functionally equivalent. FDT does not recognise this.

Can't this be modelled as uncertainty over functional equivalence? (or over input-output maps)?

Comment author: Khoth 25 October 2017 05:58:34AM 1 point [-]

I think there are two ways that a reward function can be applicable:

1) For making moral judgements about how you should treat your agent. Probably irrelevant for your button presser unless you're a panpsychist.

2) If the way your agent works is by predicting the consequences of its actions and attempting to pick an action that maximises some reward (eg a chess computer trying to maximise its board valuation function). Your agent H as described doesn't work this way, although as you note there are agents which do act this way and produce the same behaviour as your H.

There's also the kind-of option:

3) Anything can be modelled as if it had a utility function, in the same way that any solar system can be modelled as a geocentric one with enough epicycles. In this case there's no "true" reward function, just "the reward function that makes the maths I want to do as easy as possible". Which one that is depends on what you're trying to do, and maybe pretending there's a reward function isn't actually better than using H's true non-reward-based algorithm.

Comment author: Stuart_Armstrong 25 October 2017 03:51:04PM 0 points [-]

My "solution" does use 2), and should be posted in the next few days (maybe on lesswrong 2 only - are you on that?)

Comment author: Stuart_Armstrong 24 October 2017 03:18:58PM 0 points [-]

Some practical examples of what you mean could be useful.

Comment author: toonalfrink 13 October 2017 02:13:50PM 0 points [-]

I'd like to note that "caring about Us a bit" can also be read as "small probability of caring about Us a lot".

Comment author: Stuart_Armstrong 14 October 2017 06:17:04AM 0 points [-]

Actually, a small probability of caring about Us a bit, can suffice.

Comment author: turchin 13 October 2017 02:19:47PM 0 points [-]

Also, the question was not if I could judge other's values, but is it possible to prove that AI has the same values as a human being.

Or are you going to prove the equality of two value systems while at least one of them of them remains unknowable?

Comment author: Stuart_Armstrong 14 October 2017 06:12:14AM 1 point [-]

I'm more looking at "formalising human value-like things, into something acceptable".

View more: Next