Here's a scenario that doesn't seem completely implausible. Suppose Bob is someone whose public key is easily available on the internet. The first AI will read things on the internet and output a message. Some of the message will get put on the public internet. Bob suspects that the AI might have secretly sent him a message (e.g. giving him advice on which stocks to buy). So he tries using his private key to decrypt some of the AI's output (e.g. the lowest-order bits in some images the AI has output).
Knowing that Bob (or someone else like Bob) will li...
The trouble is that it's much easier to create a steganographic message (e.g. encrypting a message using a particular public key) than to detect it (which requires knowing the private key or otherwise breaking the encryption). So in this case "much more computing power" has to mean "exponentially more computing power".
You might be interested in reading:
https://medium.com/ai-control/mimicry-maximization-and-meeting-halfway-c149dd23fc17#.v6e533hkf
https://medium.com/ai-control/elaborations-of-apprenticeship-learning-eb93a53ae3ca#.5ubczdqf0
https://intelligence.org/files/QuantilizersSaferAlternative.pdf
This prevents the first AI from doing evil things with it's output. If it tries to insert complicated infohazards or subagents into it's output stream, it will be easily detected as an AI. Instead it needs to mimic humans as closely as possible.
Note that steganography is s...
One of the most common objection's I've seen is that we're too far from getting AGI to know what AGI will be like, so we can't productively work on the problem without making a lot of conjunctive assumptions -- e.g. see this post.
And before I could scribble a damned thing, Calude went and solved it six months ago. The Halting Problem, I mean.
Cool. If I get the meaning of the result well, it's that if you run a random program for some number of steps and it doesn't halt, then (depending on the exact numbers) it will be unlikely to halt when run on a supercomputer either, because halting times have low density. So almost all programs halt quickly or run a really really long time. Is this correct? This doesn't quite let you approximate Chaitin's omega, but it's interesting tha...
It's not a specific programming language, I guess it's meant to look like Church. It could be written as:
(query
. (define a (p))
. (foreach (range n) (lambda i)
. . (define x (x-prior))
. . (factor (log (U x a)))))
Well so does the sigmoided version
It samples an action proportional to p(a) E[sigmoid(U) | a]. This can't be written as a function of E[U | a].
Due to the planning model, the successor always has some nonzero probability of not pressing the button, so (depending on how much you value pressing it later) it'll be worth it to press it at some point.
When you use e-raised-to-the alpha times expectation, is that similar to the use of an exponential distribution in something like Adaboost, to take something like odds information and form a distribution over assorted weights?
I'm not really that familiar with adaboost. The planning model is just reflecting the fact that bounded agents don't always take the maximum expected utility action. The higher alpha is, the more bias there is towards good actions, but the more potentially expensive the computation is (e.g. if you use rejection sampling).
...Since
Your model selects an action proportional to p(a) E[sigmoid(U) | a], whereas mine selects an action proportional to p(a) e^E[U | a]. I think the second is better, because it actually treats actions the same if they have the same expected utility. The sigmoid version will not take very high utilities or very low utilities into account much.
Btw it's also possible to select an action proportional to E[U | a]^n:
query {
. a ~ p()
. for i = 1 to n
. . x_i ~ P(x)
. . factor(log U(x, a))
}
(BTW: here's a writeup of one of my ideas for writing planning queries that you might be interested in)
Often we want a model where the probability of taking action a is proportional to p(a)e^E[U(x, a)], where p is the prior over actions, x consists of some latent variables, and U is the utility function. The straightforward way of doing this fails:
query {
. a ~ p()
. x ~ P(x)
. factor(U(x, a))
}
Note that I'm assuming factor takes a log probability as its argument. This fails due to "wishful thinking": it tends to prefer ris...
ZOMFG, can you link to a write-up? This links up almost perfectly with a bit of research I've been wanting to do.
Well, a write-up doesn't exist because I haven't actually done the math yet :)
But the idea is about algorithms for doing nested queries. There's a planning framework where you take action a proportional to p(a) e^E[U | a]. If one of these actions is "defer to your successor", then the computation of (U | a) is actually another query that samples a different action b proportional to p(b) e^E[U | b]. In this case you can actually j...
Yes, something like that, although I don't usually think of it as an adversary. Mainly it's so I can ask questions like "how could a FAI model its operator so that it can infer the operator's values from their behavior?" without getting hung up on the exact representation of the model or how the model is found. We don't have any solution to this problem, even if we had access to a probabilistic program induction black box, so it would be silly to impose the additional restriction that we can't give the black box any induction problems that are...
I should be specific that the kinds of results we want to get are those where you could, in principle, use a very powerful computer instead of a hypercomputer. Roughly, the unbounded algorithm should be a limit of bounded algorithms. The kinds of allowed operations I am thinking about include:
In all these cases, you can get arbitrarily good a...
Thanks for the detailed response! I do think the framework can still work with my assumptions. The way I would model it would be something like:
This model seems quite a bit different from mine, which is that FAI research is about reducing FAI to an AGI problem, and solving AGI takes more work than doing this reduction.
More concretely, consider a proposal such as Paul's reflective automated philosophy method, which might be able to be implemented using epsiodic reinforcement learning. This proposal has problems, and it's not clear that it works -- but if it did, then it would have reduced FAI to a reinforcement learning problem. Presumably, any implementations of this proposal would benefit from ...
I agree that choosing an action randomly (with higher probability for good actions) is a good way to create a fuzzy satisficer. Do you have any insights into how to:
create queries for planning that don't suffer from "wishful thinking", with or without nested queries. Basically the problem is that if I want an action conditioned on receiving a high utility (e.g. we have a factor on the expected utility node U equal to e^(alpha * U) ), then we are likely to choose high-variance actions while inferring that the rest of the model works out such that these actions return high utilities
extend this to sequential planning without nested nested nested nested nested nested queries
I would tend to say that you should be training a conceptual map of the world before you install anything like action-taking capability or a goal system of any kind.
This seems like a sane thing to do. If this didn't work, it would probably be because either
lack of conceptual convergence and human understandability; this seems somewhat likely and is probably the most important unknown
our conceptual representations are only efficient for talking about things we care about because we care about these things; a "neutral" standard such as reso
Regularization is already a part of training any good classifier.
A technical point here: we don't learn a raw classifier, because that would just learn human judgments. In order to allow the system to disagree with a human, we need to use some metric other than "is simple and assigns high probability to human judgments".
...For something like FAI, I want a concept-learning algorithm that will look at the world in this naturalized, causal way (which is what normal modelling shoots for!), and that will model correctly at any level of abstraction
Okay, thanks a lot for the detailed response. I'll explain a bit about where I'm coming from with understading the concept learning problem:
I think you have homed in exactly on the place where the disagreement is located. I am glad we got here so quickly (it usually takes a very long time, where it happens at all).
Yes, it is the fact that "weak constraint" systems have (supposedly) the property that they are making the greatest possible attempt to find a state of mutual consistency among the concepts, that leads to the very different conclusions that I come to, versus the conclusions that seem to inhere in logical approaches to AGI. There really is no underestimating the drastic di...
We can do something like list a bunch of examples, have humans label them, and then find the lowest Kolomogorov complexity concept that agrees with human judgments in, say, 90% of cases. I'm not sure if this is what you mean by "normatively correct", but it seems like a plausible concept that multiple concept learning algorithms might converge on. I'm still not convinced that we can do this for many value-laden concepts we care about and end up with something matching CEV, partially due to complexity of value. Still, it's probably worth systematically studying the extent to which this will give the right answers for non-value-laden concepts, and then see what can be done about value-laden concepts.
Thanks for your response.
The AI can quickly assess the "forcefulness" of any candidate action plan by asking itself whether the plan will involve giving choices to people vs. forcing them to do something whether they like it or not. If a plan is of the latter sort, more care is needed, so it will canvass a sample of people to see if their reactions are positive or negative.
So, I think this touches on the difficult part. As humans, we have a good idea of what "giving choices to people" vs. "forcing them to do something" l...
Thanks for posting this; I appreciate reading different perspectives on AI value alignment, especially from AI researchers.
...But, truthfully, it would not require a ghost-in-the-machine to reexamine the situation if there was some kind of gross inconsistency with what the humans intended: there could be some other part of its programming (let’s call it the checking code) that kicked in if there was any hint of a mismatch between what the AI planned to do and what the original programmers were now saying they intended. There is nothing difficult or intrinsi
I am going to have to respond piecemeal to your thoughtful comments, so apologies in advance if I can only get to a couple of issues in this first response.
Your first remark, which starts
If there is some good way...
contains a multitude of implicit assumptions about how the AI is built, and how the checking code would do its job, and my objection to your conclusion is buried in an array of objections to all of those assumptions, unfortunately. Let me try to bring some of them out into the light:
1) When you say
...If there is some good way of explaining
So, you can compress a list of observations about which Turing machines halt by starting with a uniform prior over Chaitin's omega. This can lead to quite a lot of compression: the information of whether the first n Turing machines halt consists of n bits, but only requires log(n) bits of Chaitin's omega. If we saw whether more Turing machines halted, we would also uncover more bits of Chaitin's omega. Is this the kind of thing you are thinking of?
I guess there's another question of how any of this makes sense if the universe is computable. We can stil...
One part I'm not clear on is how the empirical knowledge works. The equivalent of "kilograms of mass" might be something like bits of Chaitin's omega. If you have n bits of Chaitin's omega, you can solve the halting problem for any Turing machine of length up to n. But, while you can get lower bounds on Chaitin's omega by running Turing machines and seeing which halt, you can't actually learn upper bounds on Chaitin's omega except by observing uncomputable processes (for example, a halting oracle confirming that some Turing machine doesn't hal...
There's one scenario described in this paper on which this decision theory gives in to blackmail:
...The Retro Blackmail problem. There is a wealthy intelligent system and an honest AI researcher with access to the agent’s original source code. The researcher may deploy a virus that will cause $150 million each in damages to both the AI system and the researcher, and which may only be deactivated if the agent pays the researcher $100 million. The researcher is risk-averse and only deploys the virus upon becoming confident that the agent will pay up. The age
It's possible to compute whether each machine halts using an inductive Turing machine like this:
initialize output tape to all zeros, representing the assertion that no Turing machine halts
for i = 1 to infinity
. for j = 1 to i
. . run Turing machine j for i steps
. . if it halts: set bit j in the output tape to 1
Is this what you meant? If so, I'm not sure what this has to do with observing loops.
When you say that every nonhalting Turing machine has some kind of loop, do you mean the kind of loop that many halting Turing machines also contain?
Thanks for the response. I should note that we don't seem to disagree on the fact that a significant portion of AI safety research should be informed by practical considerations, including current algorithms. I'm currently getting a masters degree in AI while doing work for MIRI, and a substantial portion of my work at MIRI is informed by my experience with more practical systems (including machine learning and probabilistic programming). The disagreement is more that you think that unbounded solutions are almost entirely useless, while I think they are...
Learning how to create even a simple recommendation engine whose output is constrained by the values of its creators would be a large step forward and would help society today.
I think something showing how to do value learning on a small scale like this would be on topic. It might help to expose the advantages and disadvantages of algorithms like inverse reinforcement learning.
I also agree that, if there are more practical applications of AI safety ideas, this will increase interest and resources devoted to AI safety. I don't really see those applic...
I don't have a great understanding of the history of engineering, but I get the impression that working from the theory backwards can often be helpful. For example, Turing developed the basics of computer science before sufficiently general computers existed.
The first computer was designed by Babbage who was mostly interested in practical applications (although admitedly it was never built.) 100 years later Konrad Zuse developed the first working computer and was also for practical purposes. I'm not sure if he was even aware of Turing's work.
Not that Tu...
I think a post saying something like "Deep learning architectures are/are not able to learn human values because of reasons X, Y, Z" would definitely be on topic. As an example of something like this, I wrote a post on the safety implications of statistical learning theory. However, an article about how deep learning algorithms are performing on standard machine learning tasks is not really on topic.
I share your sentiment that safety research is not totally separate from other AI research. But I think there is a lot to be done that does not re...
This is an interesting approach. The way I'm currently thinking of this is that you ask what agent a UDT would design, and then do what that agent does, and vary what type an agent is between the different designs. Is this correct?
Consider the anti-Newcomb problem with Omega's simulation involving equation (2)
So is this equation (2) with P replaced with something else?
However, the computing power allocated for evaluation the logical expectation value in (2) might be sufficient to suspect P's output might be an agent reasoning based on (2).
I don't understand this sentence.
It still seems like this is very much affected by the measure you assign to different game of life universes, and that the measure strongly depends on f.
Suppose we want to set f to control the agent's behavior, so that when it sees sensory data s, it takes silly action a(s), where a is a short function. To work this way, f will map game of life states in which the agent has seen s and should take action a(s) to binary strings that have greater measure, compared to game of life states in which the agent has seen s and should take some other action. I thin...
Thanks for the additional explanation.
It is of similar magnitude to differences between using different universal Turing machines in the definition of the Solomonoff ensemble. These difference become negligible for agents that work with large amounts of evidence.
Hmm, I'm not sure that this is something that you can easily get evidence for or against? The 2^K factor in ordinary Solomonoff induction is usually considered fine because it can only cause you to make at most K errors. But here it's applying to utilities, which you can't get evidence for ...
I think you're essentially correct about the problem of creating a utility function that works across all different logically possible universes being important. This is kind of like what was explored in the ontological crisis paper. Also, I agree that we want to do something like find a human's "native domain" and map it to the true reality in order to define utility functions over reality.
I think using something like Solomonoff induction to find multi-level explanations is a good idea, but I don't think your specific formula works. It looks ...
So, you can kill a person, create a new person, and raise them to be about equivalent to the original person (on average; this makes a bit more sense if we do it many times so the distribution of people, life outcomes, etc is similar). I guess your question is, why don't we do this (aside from the cost)? A few reasons come to mind:
I said that fetuses are replaceable, not that all people are replaceable. OP didn't argue that fetuses weren't replaceable, just that they won't get replaced in practice.
I don't think you did justice to the replaceability argument. If fetuses are replaceable, then the only benefit of banning abortion is that it increases the fertility rate. However, there are far better ways to increase the fertility rate than banning abortion. For example, one could pay people to have children (and maybe give them up for adoption). So your argument is kind of like saying that since we really need farm laborers, we should allow slavery.
I think a useful meaning of "incomparable" is "you should think a very long time before deciding between these". In situations like these, the right decision is not to immediately decide between them, but to think a lot about the decision and related issues. Sure, if someone has to make a split-second decision, they will probably choose whichever sounds better to them. But if given a long time, they might think about it a lot and still not be sure which is better.
This seems a bit similar to multiple utility functions in that if you h...
We updated on the fact that we exist. SSA does this a little too: specifically, the fact that you exist means that there is at least one observer. One way to look at it is that there is initially a constant number of souls that get used to fill in the observers of a universe. In this formulation, SIA is the result of the normal Bayesian update on the fact that soul-you woke up in a body.
Transcript:
Question: Are you as afraid of artificial intelligence as your Paypal colleague Elon Musk?
Thiel: I'm super pro-technology in all its forms. I do think that if AI happened, it would be a very strange thing. Generalized artificial intelligence. People always frame it as an economic question, it'll take people's jobs, it'll replace people's jobs, but I think it's much more of a political question. It would be like aliens landing on this planet, and the first question we ask wouldn't be what does this mean for the economy, it would be are they...
Context: Elon Musk thinks there's an issue in the 5-7 year timeframe (probably due to talking to Demis Hassabis at Deepmind, I would guess). By that standard I'm also less afraid of AI than Elon Musk, but as Rob Bensinger will shortly be fond of saying, this conflates AGI danger with AGI imminence (a very very common conflation).
I'm talking about the fact that humans can (and sometimes do) sort of optimize the universe. Like, you can reason about the way the universe is and decide to work on causing it to be in a certain state.
So people say they have general goals, but in reality they remain human beings with various tendencies, and continue to act according to those tendencies, and only support that general goal to the extent that it's consistent with those other behaviors.
This could very well be the case, but humans still sometimes sort of optimize the universe. Like, I'm ...
In that sense I think the orthogonality thesis will turn out to be false in practice, even if it is true in theory. It is simply too difficult to program a precise goal into an AI, because in order for that to work the goal has to be worked into every physical detail of the thing. It cannot just be a modular add-on.
I find this plausible but not too likely. There are a few things needed for a universe-optimizing AGI:
really good mathematical function optimization (which you might be able to use to get approximate Solomonoff induction)
a way to specify
I, too, think that AIs that don't optimize a function over the universe (but might optimize one over a domain) are more likely to be safe. This is quite related to the idea of tool AI, proposed by Holden and criticized by Eliezer.
The key here seems to be creating a way to evaluate and search for self-improvements in a way that won't cause optimization over universe states. In theory, evaluation of a self-improvement might be able to be restricted to a domain: does this modification help me play chess better according to a model of the situation in which ...
Am working on it - as a placeholder, for many problems, one can use Stuart Armstrong's proposed algorithm of finding the best strategy according to a non-anthropic viewpoint that adds the utilities of different copies of you, and then doing what that strategy says.
I think this essentially leads to SIA. Since you're adding utilities over different copies of you, it follows that you care more about universes in which there are more copies of you. So your copies should behave as if they anticipate the probability of being in a universe containing lots of...
I'm not sure what you mean by "vanilla anthropics". Both SSA and SIA are "simple object-level rules for assigning anthropic probabilities". Vanilla anthropics seems to be vague enough that doesn't give an answer to the doomsday argument or the presumptuous philosopher problem.
On another note, if you assume that a nonzero percentage of the multiverse's computation power is spent simulating arbitrary universes with computation power in proportion to the probabilities of their laws of physics, then both SSA and SIA will end up giving you very similar predictions to Brian_Tomasik's proposal, although I think they might be slightly different.
Okay. We seem to be disputing definitions here. By your definition, it is totally possible to build a very good cross-domain optimizer without it being an agent (so it doesn't optimize a utility function over the universe). It seems like we mostly agree on matters of fact.
I agree with this but I prefer weighting things by computation power instead of physics cells (which may turn out to be somewhat equivalent). It's easy to justify this model by assuming that some percentage of the multiverse's computation power is spent simulating all universes in parallel. See Schmidhuber's paper on this.
Well, he's right that intentionally evil AI is highly unlikely to be created:
Malevolent AI would need all these capabilities, and then some. Both an intent to do something and an understanding of human goals, motivations, and behaviors would be keys to being evil towards humans.
which happens to be the exact reason why Friendly AI is difficult. He doesn't directly address things that don't care about humans, like paperclip maximizers, but some of his arguments can be applied to them.
...Expecting more computation to just magically get to intentional int
I don't think the connotations of "silly" are quite right here. You could still use this program to do quite a lot of useful inference and optimization across a variety of domains, without killing everyone. Sort of like how frequentist statistics can be very accurate in some cases despite being suboptimal by Bayesian standards. Bostrom mostly only talks about agent-like AIs, and while I think that this is mostly the right approach, he should have been more explicit about that. As I said before, we don't currently know how to build agent-like AGIs at the moment because we haven't solved the ontology mapping problem, but we do know how to build non-agentlike cross-domain optimizers given enough computation power.
1) Perhaps you give it one domain and a utility function within that domain, and it returns a good action in this domain. Then you give it another domain and a different utility function, and it returns a good action in this domain. Basically I'm saying that it doesn't maximize a single unified utility function.
2) You prove too much. This implies that the Unix cat program has a utility function (or else it is wasting effort). Technically you could view it as having a utility function of "1 if I output what the source code of cat outputs, 0 otherwi...
This one. It doesn't log data.