itaibn0 comments on Reply to Holden on 'Tool AI' - Less Wrong

94 Post author: Eliezer_Yudkowsky 12 June 2012 06:00PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (348)

You are viewing a single comment's thread. Show more comments above.

Comment author: HoldenKarnofsky 05 July 2012 04:18:16PM 18 points [-]

Hello,

I appreciate the thoughtful response. I plan to respond at greater length in the future, both to this post and to some other content posted by SI representatives and commenters. For now, I wanted to take a shot at clarifying the discussion of "tool-AI" by discussing AIXI. One of the the issues I've found with the debate over FAI in general is that I haven't seen much in the way of formal precision about the challenge of Friendliness (I recognize that I have also provided little formal precision, though I feel the burden of formalization is on SI here). It occurred to me that AIXI might provide a good opportunity to have a more precise discussion, if in fact it is believed to represent a case of "a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button."

So here's my characterization of how one might work toward a safe and useful version of AIXI, using the "tool-AI" framework, if one could in fact develop an efficient enough approximation of AIXI to qualify as a powerful AGI. Of course, this is just a rough outline of what I have in mind, but hopefully it adds some clarity to the discussion.

A. Write a program that

  1. Computes an optimal policy, using some implementation of equation (20) on page 22 of http://www.hutter1.net/ai/aixigentle.pdf
  2. "Prints" the policy in a human-readable format (using some fixed algorithm for "printing" that is not driven by a utility function)
  3. Provides tools for answering user questions about the policy, i.e., "What will be its effect on _?" (using some fixed algorithm for answering user questions that makes use of AIXI's probability function, and is not driven by a utility function)
  4. Does not contain any procedures for "implementing" the policy, only for displaying it and its implications in human-readable form

B. Run the program; examine its output using the tools described above (#2 and #3); if, upon such examination, the policy appears potentially destructive, continue tweaking the program (for example, by tweaking the utility it is selecting a policy to maximize) until the policy appears safe and desirable

C. Implement the policy using tools other than AIXI agent

D. Repeat (B) and (C) until one has confidence that the AIXI agent reliably produces safe and desirable policies, at which point more automation may be called for

My claim is that this approach would be superior to that of trying to develop "Friendliness theory" in advance of having any working AGI, because it would allow experiment- rather than theory-based development. Eliezer, I'm interested in your thoughts about my claim. Do you agree? If not, where is our disagreement?

Comment author: Eliezer_Yudkowsky 11 July 2012 10:59:07PM 29 points [-]

Didn't see this at the time, sorry.

So... I'm sorry if this reply seems a little unhelpful, and I wish there was some way to engage more strongly, but...

Point (1) is the main problem. AIXI updates freely over a gigantic range of sensory predictors with no specified ontology - it's a sum over a huge set of programs, and we, the users, have no idea what the representations are talking about, except that at the end of their computations they predict, "You will see a sensory 1 (or a sensory 0)." (In my preferred formalism, the program puts a probability on a 0 instead.) Inside, the program could've been modeling the universe in terms of atoms, quarks, quantum fields, cellular automata, giant moving paperclips, slave agents scurrying around... we, the programmers, have no idea how AIXI is modeling the world and producing its predictions, and indeed, the final prediction could be a sum over many different representations.

This means that equation (20) in Hutter is written as a utility function over sense data, where the reward channel is just a special case of sense data. We can easily adapt this equation to talk about any function computed directly over sense data - we can get AIXI to optimize any aspect of its sense data that we please. We can't get it to optimize a quality of the external universe. One of the challenges I listed in my FAI Open Problems talk, and one of the problems I intend to talk about in my FAI Open Problems sequence, is to take the first nontrivial steps toward adapting this formalism - to e.g. take an equivalent of AIXI in a really simple universe, with a really simple goal, something along the lines of a Life universe and a goal of making gliders, and specify something given unlimited computing power which would behave like it had that goal, without pre-fixing the ontology of the causal representation to that of the real universe, i.e., you want something that can range freely over ontologies in its predictive algorithms, but which still behaves like it's maximizing an outside thing like gliders instead of a sensory channel like the reward channel. This is an unsolved problem!

We haven't even got to the part where it's difficult to say in formal terms how to interpret what a human says s/he wants the AI to plan, and where failures of phrasing of that utility function can also cause a superhuman intelligence to kill you. We haven't even got to the huge buried FAI problem inside the word "optimal" in point (1), which is the really difficult part in the whole thing. Because so far we're dealing with a formalism that can't even represent a purpose of the type you're looking for - it can only optimize over sense data, and this is not a coincidental fact, but rather a deep problem which the AIXI formalism deliberately avoided.

(2) sounds like you think an AI with an alien, superhuman planning algorithm can tell humans what to do without ever thinking consequentialistically about which different statements will result in human understanding or misunderstanding. Anna says that I need to work harder on not assuming other people are thinking silly things, but even so, when I look at this, it's hard not to imagine that you're modeling AIXI as a sort of spirit containing thoughts, whose thoughts could be exposed to the outside with a simple exposure-function. It's not unthinkable that a non-self-modifying superhuman planning Oracle could be developed with the further constraint that its thoughts are human-interpretable, or can be translated for human use without any algorithms that reason internally about what humans understand, but this would at the least be hard. And with AIXI it would be impossible, because AIXI's model of the world ranges over literally all possible ontologies and representations, and its plans are naked motor outputs.

Similar remarks apply to interpreting and answering "What will be its effect on _?" It turns out that getting an AI to understand human language is a very hard problem, and it may very well be that even though talking doesn't feel like having a utility function, our brains are using consequential reasoning to do it. Certainly, when I write language, that feels like I'm being deliberate. It's also worth noting that "What is the effect on X?" really means "What are the effects I care about on X?" and that there's a large understanding-the-human's-utility-function problem here. In particular, you don't want your language for describing "effects" to partition, as the same state of described affairs, any two states which humans assign widely different utilities. Let's say there are two plans for getting my grandmother out of a burning house, one of which destroys her music collection, one of which leaves it intact. Does the AI know that music is valuable? If not, will it not describe music-destruction as an "effect" of a plan which offers to free up large amounts of computer storage by, as it turns out, overwriting everyone's music collection? If you then say that the AI should describe changes to files in general, well, should it also talk about changes to its own internal files? Every action comes with a huge number of consequences - if we hear about all of them (reality described on a level so granular that it automatically captures all utility shifts, as well as a huge number of other unimportant things) then we'll be there forever.

I wish I had something more cooperative to say in reply - it feels like I'm committing some variant of logical rudeness by this reply - but the truth is, it seems to me that AIXI isn't a good basis for the agent you want to describe; and I don't know how to describe it formally myself, either.

Comment author: itaibn0 06 January 2014 11:43:20PM 0 points [-]

I believe AIXI is much more inspectable than you make it out to be. I think it is important to challenge your claim here because Holden appears to have trusted your expertise and hereby concede an important part of the argument.

AIXI's utility judgements are based a Solomonoff prior, which are based on the computer programs which return the input data. Computer programs are not black-boxes. A system implementing AIXI can easily also return a sample of typical expected future histories and the programs compressing these histories. By examining these programs, we can figure out what implicit model the AIXI system has of its world. These programs are optimized for shortness so they are likely to be very obfuscated, but I don't expect them to be incomprehensible (after all, they're not optimized for incomprehensibility). Even just sampling expected histories without their compressions is likely to be very informative. In the case of AIXItl the situation is better in the sense that it's output at any give time is guaranteed to be generated by just one length <l subprogram, and this subprogram comes with a proof justifying its utility judgement. It's also worse in that there is no way to sample its expected future histories. However, I expect the proof provided would implicitly contain such information. If either the programs or the proofs cannot be understood by humans, the programmers can just reject them and look at the next best candidates.

As for "What will be its effect on _?", this can be answered as well. I already stated that with AIXI you can sample future histories. This is because AIXI has a specific known prior it implements for its future histories, namely Solomonoff induction. This ability may seem limited because it only shows the future sensory data, but sensory data can be whatever you feed AIXI as input. If you want it to a have a realistic model of the world, this includes a lot of relevant information. For example, if you feed it the entire database of Wikipedia, it can give likely future versions of Wikipedia which already provides a lot of details on the effect of its actions.

Comment author: Nebu 17 February 2016 11:15:09AM 1 point [-]

Can you be a bit more specific in your interpretation of AIXI here?

Here are my assumptions, let me know where you have different assumptions:

  • Traditional-AIXI is assumed to exists in the same universe as the human who wants to use AIXI to solve some problem.
  • Traditional-AIXI has a fixed input channel (e.g. it's connected to a webcam, and/or it receives keyboard signals from the human, etc.)
  • Traditional-AIXI has a fixed output channel (e.g. it's connected to a LCD monitor, or it can control a robot servo arm, or whatever).
  • The human has somehow pre-provided Traditional-AIXI with some utility function.
  • Traditional-AIXI operates in discrete time steps.
  • In the first timestep that elapses since Traditional-AIXI is activated, Traditional-AIXI examines the input it receives. It considers all possible programs that take pair (S, A) and emits an output P, where S is the prior state, A is an action to take, and P is the predicted output of taking the action A in state S. Then it discards all programs that would not have produced the input it received, regardless of what S or A it was given. Then it weighs the remaining program according to their Kolmorogov complexity. This is basically the Solomonoff induction step.
  • Now Traditional-AIXI has to make a decision about an output to generate. It considers all possible outputs it could produce, and feeds it to the programs under consideration, to produce a predicted next time step. Traditional-AIXI then calculates the expected utility of each output (using its pre-programmed utility function), picks the one with the highest utility, and emits that output. Note that it has no idea how any of its outputs would the universe, so this is essentially a uniformly random choice.
  • In the next timestep, Traditional-AIXI reads its inputs again, but this time taking into account what output it has generated in the previous step. It can now start to model correlation, and eventually causation, between its input and outputs. It has a previous state S and it knows what action A it took in its last step. It can further discard more programs, and narrow the possible models that describes the universe it finds itself in.

How does Tool-AIXI work in contrast to this? Holden seems to want to avoid having any utility function pre-defined at all. However, presumably Tool-AIXI still receives inputs and still produces outputs (probably Holden intends not to allow Tool-AIXI to control a robot servo arm, but he might intend for Tool-AIXI to be able to control an LCD monitor, or at the very least, produce some sort of text file as output).

Does Tool-AIXI proceed in discrete time steps gathering input? Or do we prevent Tool-AIXI from running until a user is ready to submit a curated input to Tool-AIXI? If the latter, how quickly to we expect Tool-AIXI to be able to formulate an reasonable model of our universe?

How does Tool-AIXI choose what output to produce, if there's no utility function?

If we type in "Tool-AIXI, please give me a cure for cancer" onto a keyboard attached to Tool-AIXI and submit that as an input, do we think that a model that encodes ASCII, the English language, bio-organisms, etc. has a lower kolmogorov complexity than a model that says "we live in a universe where we receive exactly this hardcoded stream of bytes"?

Does Tool-AIXI model the output it produces (whether that be pixels on a screen, or bytes to a file) as an action, or does it somehow prevent itself from modelling its output as if it were an action that had some effect on the universe that it exists in? If the former, then isn't this just an agenty Oracle AI? If the latter, then what kind of programs is it generate for its model (surely not programs that take (S, A) pairs as inputs, or else what would it use for A when evaluating its plans and predicting the future)?