You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

The Ultimate Testing Grounds

6 Stuart_Armstrong 28 October 2015 05:08PM

A putative new idea for AI control; index here.

EDIT: the mathematics of this approach can be found here.

In a previous post, I described a procedure for querying an AI about whether a certain outcome was desirable for it. When I did so, I didn't realise the power of the idea.

In brief, that idea allows us to query an AI with utility v as to whether an outcome X or an outcome ¬X was the one it most desired in a particular situation. By judicious use of 'false beliefs', the AI would react as if it had miraculous powers to determine X vs ¬X in ways that no-one could ever know about.

Many (including Toby Ord) has suggested variants on the 'honey pot' idea: leave something available to the AI that looks like an oversight, but is actually a trap, to see if the AI goes beyond the parameters of what we'd expect. Maybe a wire to the internet that the AI is not supposed to use, but seems to be unobserved.

The problem with these ideas is that the honey pots had to be physically constructed, physically accessible to the AI, and the AI had to be wrong about whether it was a trap or not.

But using the method above, we can go much further. We can test the AI in any scenario which we can imagine, as long as we can precisely physically describe it. We can check what would happen if the AI was given the ability to launch the world's nuclear arsenals, or divert an asteroid towards Earth. We can check what it would design if it was given unrestricted access to a DNA synthesiser. We could see what it would do if it could rewrite the text of a speech an important world leader was due to give. We could see whether it would sabotage the creation of other AIs - and if so, which ones. We could see what it did to a probe humanity planned to launch to the stars, what numbers it would call if it had access to a phone, what it would purchase if it had capital.

And, in all cases, the AI would be behaving as if its actions were undetectable and certain to succeed, while we ensured that they were (almost certainly) detectable and certain to fail. There is no "almost" in that certain to fail condition, because the AI is acting as if it had miraculous powers that it doesn't have.

Now, we can't use testing as our sole method of AI control (for reasons similar to here), but it does allow us some pretty effective ways of catching bad AIs.

Instrumental Rationality Questions Thread

6 AspiringRationalist 27 September 2015 09:22PM

Previous thread: http://lesswrong.com/lw/mnq/instrumental_rationality_questions_thread/

This thread is for asking the rationalist community for practical advice.  It's inspired by the stupid questions series, but with an explicit focus on instrumental rationality.

Questions ranging from easy ("this is probably trivial for half the people on this site") to hard ("maybe someone here has a good answer, but probably not") are welcome.  However, please stick to problems that you actually face or anticipate facing soon, not hypotheticals.

As with the stupid questions thread, don't be shy, everyone has holes in their knowledge, though the fewer and the smaller we can make them, the better, and please be respectful of other people's admitting ignorance and don't mock them for it, as they're doing a noble thing.

(See also the Boring Advice Repository)

Ask and ye shall be answered

4 Stuart_Armstrong 18 September 2015 09:53PM

A putative new idea for AI control; index here.

EDIT: the mathematics of this idea can be found here.

It would be useful to have a mechanism for getting an AI to provide an honest answer to a specific question. This post presents such a mechanism.

The question is: what is the impact of X/¬X on the expected utility of a utility function v? Here X/¬X is some well-defined binary outcome. Formally, for a constant c (positive or negative), we want to query the AI as to whether Q(v,X,c) is true, where

Q(v,X,c) = {E(v|X) - E(v|¬X) > c}.

How can we do that? Well, for a start, let M(u) be a u-maximising agent. We'll assume we can feed M() any utility function to maximise. We'll further assume that M() is boxed, that it has a single output bit Y (or ¬Y), and that we have cut any acausal trade possibilities. We'll add another binary process Z, which, if Z=1, erases the bit from Y before we can see it. In actual fact, P(Z=0)≈1.

The remaining assumptions are on the operations we can perform on v (without needing any further understanding of v). We replace v by v#, defined in the following way. First of all, we make v# indifferent to worlds in which X≠Y or Z=0 (note: there are subtleties to defining indifference to events that are partially under the AI's control; I will address these in a later post). Therefore any v# maximiser will behave as if it lived in a world where X=Y (its output and the outcome of X were the same) and Z=1 (its output was erased and never read). Then we add a special, one-off penalty of -c for outputting Y=1 (there is no effect for outputting Y=0). This defines v#, and we give our boxed AI v# to maximise.

So, how will M(v#) behave? Because it acts as if X=Y with certainty, it behaves as if it can determine the value of X by outputting Y. Because it acts as if Z=0 with certainty, it behaves as if the output of Y is never read, so it can determine the value of X without affecting anything else about the universe. It will now behave as if it was a v maximiser with control over X. So if it outputs ¬Y, it expects to get utility E(v|¬X). If it outputs Y, it expects to get utility E(v|X)-c (because of the -c penalty).

Therefore (ignoring ties) M(v#) will output Y iff it estimates Q(v,X,c) to be true. And, since in actual fact Z=0 almost certainly, we get to see its output, hence its conclusion. Voila!

Note: the advantage of this method is to allow us to get the truth out of an AI, without knowing how to define "truth".

If you could push a button to eliminate one cognitive bias, which would you choose?

3 malcolmocean 09 April 2015 07:05AM

I realize this question is contrived, but I figure it might provoke some fun discussion, so here goes:

If you could push a button and have your brain modified to precisely remove a cognitive bias (and have no other unnecessary effects—most convenient possible world), which would you choose? Why?

What if you were choosing for the whole human race?

What I mean...

5 Stuart_Armstrong 26 March 2015 11:59AM

A putative new idea for AI control; index here.

This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.

The challenge is to get the AI to answer a question as accurately as possible, using the human definition of accuracy.

First, imagine an AI with some goal is going to answer a question, such as Q="What would happen if...?" The AI is under no compulsion to answer it honestly.

What would the AI do? Well, if it is sufficiently intelligent, it will model humans. It will use this model to understand what they meant by Q, and why they were asking. Then it will ponder various outcomes, and various answers it could give, and what the human understanding of those answers would be. This is what any sufficiently smart AI (friendly or not) would do.

Then the basic idea is to use modular design and corrigibility to extract the relevant pieces (possibly feeding them to another, differently motivated AI). What needs to be pieced together is: AI understanding of what human understanding of Q is, actual answer to Q (given this understanding), human understanding of various AI's answers (using model of human understanding), and minimum divergence between human understanding of answer and actual answer.

All these pieces are there, and if they can be safely extracted, the minimum divergence can be calculated and the actual answer calculated.

Does utilitarianism "require" extreme self sacrifice? If not why do people commonly say it does?

7 Princess_Stargirl 09 December 2014 08:32AM


Chist Hallquist wrote the following in an article (if you know the article please, please don't bring it up, I don't want to discuss the article in general):


"For example, utilitarianism apparently endorses killing a single innocent person and harvesting their organs if it will save five other people. It also appears to imply that donating all your money to charity beyond what you need to survive isn’t just admirable but morally obligatory. "


The non-bold part is not what is confusing me. But where does the "obligatory" part come in. I don't really how its obvious what, if any, ethical obligations utilitarianism implies. given a set of basic assumptions utilitarianism lets you argue whether one action is more moral than another. But I don’t see how its obvious which, if any, moral benchmarks utilitarianism sets for “obligatory.” I can see how certain frameworks on top of utilitarianism imply certain moral requirements. But I do not see how the bolded quote is a criticism of the basic theory of utilitarianism.


However this criticism comes up all the time. Honestly the best explanation I could come up with was that people were being unfair to utilitarianism and not thinking through their statements. But the above quote is by HallQ who is intelligent and thoughtful. So now I am genuinely very curious.


Do you think utilitarianism really require such extreme self sacrifice and if so why? And if it does not require this why do so many people say it does? I am very confused and would appreciate help working this out.


edit:


I am having trouble asking this question clearly. Since utilitarianism is probably best thought of as a cluster of beliefs. So its not clear what asking "does utilitarianism imply X" actually means. Still I made this post since I am confused. Many thoughtful people identity as utilitarian (for example Ozy and theunitofcaring) yet do not think people have extreme obligations. However I can think of examples where people do not seem to understand the implications of their ethical frameowrks. For example many Jewish people endorse the message of the following story:



Rabbi Hilel was asked to explain the Torah while standing on one foot and responded "What is hateful to you, do not do to your neighbor. That is the whole Torah; the rest is the explanation of this--go and study it!"


The story is presumably apocryphal but it is repeated all the time by Jewish people. However its hard to see how the story makes even a semblance of sense. The torah includes huge amounts of material that violates the "golden Rule" very badly. So people who think this story gives even a moderately accurate picture of the Torah's message are mistaken imo.

[Question] Adoption and twin studies confounders

4 Stuart_Armstrong 11 July 2014 04:44PM

Adoption and twin studies are very important for determining the impact of genes versus environment in the modern world (and hence the likely impact of various interventions). Other types of studies tend to show larger effects for some types of latter interventions, but these studies are seen as dubious, as they may fail to adjust for various confounders (eg families with more books also have more educated parents).

But adoption studies have their own confounders. The biggest ones are that in many countries, the genetic parents have a role in choosing the adoptive parents. Add the fact that adoptive parents also choose their adopted children, and that various social workers and others have great influence over the process, this would seem a huge confounder interfering with the results.

This paper also mentions a confounder for some types of twin studies, such as identical versus fraternal twins. They point out that identical twins in the same family will typically get a much greater shared environment than fraternal twins, because people will treat them much more similarly. This is to my mind quite a weak point, but it is an issue nonetheless.

Since I have very little expertise in these areas, I was just wondering if anyone knew about efforts to estimate the impact of these confounders and adjust for them.

Supposing you inherited an AI project...

-5 bokov 04 September 2013 08:07AM

Supposing you have been recruited to be the main developer on an AI project. The previous developer died in a car crash and left behind an unfinished AI. It consists of:

A. A thoroughly documented scripting language specification that appears to be capable of representing any real-life program as a network diagram so long as you can provide the following:

 A.1. A node within the network whose value you want to maximize or minimize.

 A.2. Conversion modules that transform data about the real-world phenomena your network represents into a form that the program can read.

B. Source code from which a program can be compiled that will read scripts in the above language. The program outputs a set of values for each node that will optimize the output (you can optionally specify which nodes can and cannot be directly altered, and the granularity with which they can be altered).

It gives remarkably accurate answers for well-formulated questions. Where there is a theoretical limit to the accuracy of an answer to a particular type of question, its answer usually comes close to that limit, plus or minus some tiny rounding error.

 

Given that, what is the minimum set of additional features you believe would absolutely have to be implemented before this program can be enlisted to save the world and make everyone live happily forever? Try to be as specific as possible.

Brief Question about FAI approaches

3 Dolores1984 19 September 2012 06:05AM

I've been reading through this to get a sense of the state of the art at the moment:

http://lukeprog.com/SaveTheWorld.html

Near the bottom, when discussing safe utility functions, the discussion seems to center on analyzing human values and extracting from them some sort of clean, mathematical utility function that is universal across humans.  This seems like an enormously difficult (potentially impossible) way of solving the problem, due to all the problems mentioned there.

Why shouldn't we just try to design an average bounded utility maximizer?  You'd build models of all your agents (if you can't model arbitrary ordered information systems, you haven't got an AI), run them through your model of the future resulting from a choice, take the summation of their utility over time, and take the average across all the people all the time.  To measure the utility (or at least approximate it), you could just ask the models.  The number this spits out is the output of your utility function.  It'd probably also be wise to add a reflexive consistency criteria, such that the original state of your model must consider all future states to be 'the same person.' -- and  I acknowledge that that last one is going to be a bitch to formalize.  When you've got this utility function, you just... maximize it.  

Something like this approach seems much more robust.  Even if human values are inconsistent, we still end up in a universe where most (possibly all) people are happy with their lives, and nobody gets wireheaded.  Because it's bounded, you're even protected against utility monsters.  Has something like this been considered?  Is there an obvious reason it won't work, or would produce undesirable results?

Thanks,

Dolores        

Want Free Kindle Books?

-7 demented 26 November 2011 04:15PM

I just came across this article that points out a Kindle Fire glitch: one that apparently lets you download books gratis if you cancel your purchase while the book is still downloading, and then quickly open it. The tablet downloads the book fully once it's been opened, letting you read it to your heart's content without actually having bought it.

This should be a great way for anyone wanting good(yet insanely expensive) Kindle books, provided this doesn't contradict your moral compunctions. Personally I neither own a Kindle Fire nor do I have a Kindle account, but just wanted to point this out for anyone interested here.

Of course, Amazon might have already fixed the glitch with a software update-but this has been strangely under-reported so maybe not.

 

In Defense of Objective Bayesianism: MaxEnt Puzzle.

6 Larks 06 January 2011 12:56AM

In Defense of Objective Bayesianism by Jon Williamson was mentioned recently in a post by lukeprog as the sort of book that should be being read by people on Less Wrong. Now, I have been reading it, and found some of it quite bizarre. This point in particular seems obviously false. If it’s just me, I’ll be glad to be enlightened as to what was meant. If collectively we don’t understand, that’d be pretty strong evidence that we should read more academic Bayesian stuff.

Williamson advocates use of the Maximum Entropy Principle. In short, you should take account of the limits placed on your probability by the empirical evidence, and then choose a probability distribution closest to uniform that satisfies those constraints.

So, if asked to assign a probability to an arbitrary A, you’d say p = 0.5. But if you were given evidence in the form of some constraints on p, say that p ≥ 0.8, you’d set p = 0.8, as that was the new entropy-maximising level. Constraints are restricted to Affine constraints. I found this somewhat counter-intuitive already, but I do follow what he means.

But now for the confusing bit. I quote directly;

 

“Suppose A is ‘Peterson is a Swede’, B is ‘Peterson is a Norwegian’, C is ‘Peterson is a Scandinavian’, and ε is ‘80% of all Scandinavians are Swedes’. Initially, the agent sets P(A) = 0.2, P(B) = 0.8, P(C) = 1 P(ε) = 0.2, P(A & ε) = P(B & ε) = 0.1. All these degrees of belief satisfy the norms of subjectivism. Updating by maxent on learning ε, the agent believes Peterson is a Swede to degree 0.8, which seems quite right. On the other hand, updating by conditionalizing on ε leads to a degree of belief of 0.5 that Peterson is a Swede, which is quite wrong. Thus, we see that maxent is to be preferred to conditionalization in this kind of example because the conditionalization update does not satisfy the new constraints X’, while the maxent update does.”

p80, 2010 edition. Note that this example is actually from Bacchus et al (1990), but Williamson quotes approvingly.

 

His calculation for the Bayesian update is correct; you do get 0.5. What’s more, this seems to be intuitively the right answer; the update has caused you to ‘zoom in’ on the probability mass assigned to ε, while maintaining relative proportions inside it.

As far as I can see, you get 0.8 only if we assume that Peterson is a randomly chosen Scandinavian. But if that were true, the prior given is bizarre. If he was a randomly chosen individual, the prior should have been something like P(A & ε) = 0.16 P(B & ε) = 0.04 The only way I can make sense of the prior is if constraints simply “don’t apply” until they have p=1.

Can anyone explain the reasoning behind a posterior probability of 0.8?