Value learners & wireheading
Dewey 2011 lays out the rules for one kind of agent with a mutable value system. The agent has some distribution over utility functions, which it has rules for updating based on its interaction history (where "interaction history" means the agent's observations and actions since its origin). To choose an action, it looks through every possible future interaction history, and picks the action that leads to the highest expected utility, weighted both by the possibility of making that future happen and the utility function distribution that would hold if that future came to pass.
We might motivate this sort of update strategy by considering a sandwich-drone bringing you a sandwich. The drone can either go to your workplace, or go to your home. If we think about this drone as a value-learner, then the "correct utility function" depends on whether you're at work or at home - upon learning your location, the drone should update its utility function so that it wants to go to that place. (Value learning is unnecessarily indirect in this case, but that's because it's a simple example.)
Suppose the drone begins its delivery assigning equal measure to the home-utility-function and to the work-utility-function (i.e. ignorant of your location), and can learn your location for a small cost. If the drone evaluated this idea with its current utility function, it wouldn't see any benefit, even though it would in fact deliver the sandwich properly - because under its current utility function there's no point to going to one place rather than the other. To get sensible behavior, and properly deliver your sandwich, the drone must evaluate actions based on what utility function it will have in the future, after the action happens.
If you're familiar with how wireheading or quantum suicide look in terms of decision theory, this method of deciding based on future utility functions might seem risky. Fortunately, value learning doesn't permit wireheading in the traditional sense, because the updates to the utility function are an abstract process, not a physical one. The agent's probability distribution over utility functions, which is conditional on interaction histories, defines which actions and observations are allowed to change the utility function during the process of predicting expected utility.
Dewey also mentions that so long as the probability distribution over utility functions is well-behaved, you cannot deliberately take action to raise the probability of one of the utility functions being true. But I think this is only useful to safety when we understand and trust the overarching utility function that gets evaluated at the future time horizon. If instead we start at the present, and specify a starting utility function and rules for updating it based on observations, this complex system can evolve in surprising directions, including some wireheading-esque behavior.
The formalism of Dewey 2011 is, at bottom, extremely simple. I'm going to be a bad pedagogue here: I think this might only make sense if you go look at equations 2 and 3 in the paper, and figure out what all the terms do, and see how similar they are. The cheap summary is that if your utility is a function of the interaction history, trying to change utility functions based on interaction history just gives you back a utility function. If we try to think about what sort of process to use to change an agent's utility function, this formalism provides only one tool: look out to some future time horizon, and define an effective utility function in terms of what utility functions are possible at that future time horizon. This is different from the approximations or local utility functions we would like in practice.
If we take this scheme and try to approximate it, for example by only looking N steps into the future, we run into problems; the agent will want to self-modify so that next timestep it only looks ahead N-1 steps, and then N-2 steps, and so on. Or more generally, many simple approximation schemes are "sticky" - from inside the approximation, an approximation that changes over time looks like undesirable value drift.
Common sense says this sort of self-sabotage should be eliminable. One should be able to really care about the underlying utility function, not just its approximation. However, this problem tends to crop up, for example whenever the part of the future you look at does not depend on which action you are considering; modifying to keep looking at the same part of the future unsurprisingly improve the results you get in that part of the future. If we want to build a paperclip maximizer, it shouldn't be necessary to figure out every single way to self-modify and penalize them appropriately.
We might evade this particular problem using some other method of approximation that does something more like reasoning about actions than reasoning about futures. The reasoning doesn't have to be logically impeccable - we might imagine an agent that identifies a small number of salient consequences of each action, and chooses based on those. But it seems difficult to show how such an agent would have good properties. This is something I'm definitely interested in.
One way to try to make things concrete is to pick a local utility function and specify rules for changing it. For example, suppose we wanted an AI to flag all the 9s in the MNIST dataset. We define a single-time-step utility function by a neural network that takes in the image and the decision of whether to flag or not, and returns a number between -1 and 1. This neural network is deterministically trained for each time step on all previous examples, trying to assign 1 to correct flaggings and -1 to mistakes. Remember, this neural net is just a local utility function - we can make a variety of AI designs involving it. The goal of this exercise is to design an AI that seems liable to make good decisions in order to flag lots of 9s.
The simplest example is the greedy agent - it just does whatever has a high score right now. This is pretty straightforward, and doesn't wirehead (unless the scoring function somehow encodes wireheading), but it doesn't actually do any planning - 100% of the smarts have to be in the local evaluation, which is really difficult to make work well. This approach seems unlikely to extend well to messy environments.
Since Go-playing AI is topical right now, I shall digress. Successful Go programs can't get by with only smart evaluations of the current state of the board, they need to look ahead to future states. But they also can't look all the way until the ultimate time horizon, so they only look a moderate way into the future, and evaluate that future state of the board using a complicated method that tries to capture things important to planning. In sufficiently clever and self-aware agents, this approximation would cause self-sabotage to pop up. Even if the Go-playing AI couldn't modify itself to only care about the current way it computes values of actions, it might make suboptimal moves that limit its future options, because its future self will compute values of actions the 'wrong' way.
If we wanted to flag 9s using a Dewian value learner, we might score actions according to how good they will be according to the projected utility function at some future time step. If this is done straightforwardly, there's a wireheading risk - the changes to its utility function are supplied by humans who might be influenced by actions. I find it useful to apply a sort of "magic button" test - if the AI had a magic button that could rewrite human brains, would it pressing that button have positive expected utility for it? If yes, then this design has problems, even though in our current thought experiment it's just flagging pictures.
To eliminate wireheading, the value learner can use a model of the future inputs and outputs and the probability of different value updates given various inputs and outputs, which doesn't model ways that actions could influence the utility updates. This model doesn't have to be right, it just has to exist. On one hand, this seems like a sort of weird doublethink, to judge based on a counterfactual where your actions don't have impacts you could otherwise expect. On the other hand, it also bears some resemblance to how we actually reason about moral information. Regardless, this agent will now not wirehead, and will want to get good results by learning about the world, if only in the very narrow sense of wanting to play unscored rounds that update its value function. If its value function and value updating made better use of unlabeled data, it would also want to learn about the world in the broader sense.
Overall I am somewhat frustrated, because value learners have these nice properties, but are computationally unrealistic and do not play well with approximation. One can try to get the nice properties elsewhere, such as relying on an action-suggester to not suggest wireheading, but it would be nice to be able to talk about this as an approximation to something fancier.
Communicating concepts in value learning
Epistemic status: Trying to air out some thoughts for feedback, we'll see how successfully. May require some machine learning to make sense, and may require my level of ignorance to seem interesting.
Many current proposals for value learning are garden-variety regression (or its close cousin, classification). The agent doing the learning starts out with some model for what human values look like (a utility function over states of the world, or a reward function in a Markov decision process, or an expected utility function over possible actions), and receives training data that tells it the right thing to do in a lot of different situations. And so the agent finds the parameters of the model that minimize some loss function with the data, and Learns Human Values.
All these models of "the right thing to do" I mentioned are called parametric models, because they have some finite template that they update based on the data. Non-parametric models, on the other hand, have to keep a record of the data they've seen - prediction with a non-parametric model often looks like taking some weighted average of nearby known examples (though not always), while a parametric model would (often) fit some curve to the data and predict using that. But we'll get back to this later.
An obvious problem with current proposals is that it's very resource-intensive to communicate a category or concept to the agent. An AI might be able to automatically learn a lot about the world, but if we want to define its preferences, we have to somehow pick out the concept of "good stuff" within the representation of the world learned by the AI. Current proposals for this look like supervised learning, where huge amounts of labeled data are needed to specify "good stuff," and for many proposals I'm concerned that we'll actually end up specifying "stuff that humans can be convinced is good," which is not at all the same. Humans are much better learners than these supervised learning systems - they learn from fewer examples, and have a better grasp of the meaning and structure behind examples. This hints that there are some big improvements to be made in value learning.
This comparison to humans also leads to my vaguer concerns. It seems like the labeled examples are too crucial, and the unlabeled data not crucial enough. We want a value learner to understand concepts based on just a few examples so long as it has unlabeled data to fill in the gaps, and be able to learn more about morality from observation as a core competency, not as a pale shadow of its learning from labeled data. It seems like fine-tuning the model for the labeled data with stochastic gradient descent is missing something important.
To digress slightly, there are additional problems (e.g. corrigibility) once you build an agent that has an output channel instead of merely sponging up information, and these problems are harder if we want value learning from observation. If we want a value learning agent that could learn a simplified version of human morality, and then use that to learn the full version, we might need something like the Bayesian guarantee of Dewey 2011, or a functional analogue thereof.
One inspiration for alternative learning schemes might be clustering. As a toy example, imagine finding literal clusters in thing-space by k-means clustering. If you want to specify a cluster, you can do something like pick a small sample of examples and force them to be in the same cluster, and allow the number of clusters you try to find in the data to vary so that the statistics of the mandatory cluster are not very different from any other's. The huge problem here is that the idea of "thing-space" elides the difficulty of learning a representation of the world (or equivalently, elides how really, really complicated the cluster boundaries are in terms of observations).
Because learning how to understand the world already requires you to be really good at learning things, it's not obvious to me what identifying and using clusters in the data will entail. One might imagine that if we modeled the world using a big pile of autoencoders, this pile would already contain predictors for many concepts we might want to specify, but that if we use examples to try and communicate a concept that was not already learned, the pile might not even contain the features that make our concept easy to specify. Further speculation in this vein is fun, but is likely pointless at my current level of understanding. So even though learning well from unlabeled data is an important desideratum, I'm including this digression on clustering because I think it's interesting, not because I've shed much light.
Okay, returning to the parametric/non-parametric thing. The problem of being bad at learning from unlabeled data shows up in diverse proposals like inverse reinforcement learning and Hibbard 2012's two-part example. And in these cases it's not due to the learning algorithm per se, but for the simple reason that at some point the representation of the world is treated as fixed - the value learner is assumed to understand the world, and then proceeds to learn or be told human values in terms of that understanding. If you can no longer update your understanding of the world, naturally this causes problems with learning from observation.
We should instead design agents that are able to keep learning about the world. And this brings us back to the idea of communicating concepts via examples. The most reasonable way to update learned concepts in light of new information seems to be to just store the examples and re-apply them to the new understanding. This would be a non-parametric model of learned concepts.
What concepts to learn and how to use them to make decisions is not at all known to me, but as a placeholder we might consider the task of learning to identify "good actions," given proposed actions and some input about the world (similar to the "Learning from examples" section of Christiano's Approval Directed Agents).
Meetup : Urbana-Champaign: Quorum for discourse
Discussion article for the meetup : Urbana-Champaign: Quorum for discourse
Another year, another chance to come to a LW meetup. Find us at the scenic north entrance of Altgeld Hall. I'll bring delicious food. Depending on what people want, discussion may vary in technical level. Math is good, but sometimes fun digressions are fine too.
Discussion article for the meetup : Urbana-Champaign: Quorum for discourse
Moral AI: Options
Epistemic status: One part quotes (informative, accurate), one part speculation (not so accurate).
One avenue towards AI safety is the construction of "moral AI" that is good at solving the problem of human preferences and values. Five FLI grants have recently been funded that pursue different lines of research on this problem.
The projects, in alphabetical order:
Most contemporary AI systems base their decisions solely on consequences, whereas humans also consider other morally relevant factors, including rights (such as privacy), roles (such as in families), past actions (such as promises), motives and intentions, and so on. Our goal is to build these additional morally relevant features into an AI system. We will identify morally relevant features by reviewing theories in moral philosophy, conducting surveys in moral psychology, and using machine learning to locate factors that affect human moral judgments. We will use and extend game theory and social choice theory to determine how to make these features more precise, how to weigh conflicting features against each other, and how to build these features into an AI system. We hope that eventually this work will lead to highly advanced AI systems that are capable of making moral judgments and acting on them.
Techniques: Top-down design, game theory, moral philosophy
Previous work in economics and AI has developed mathematical models of preferences, along with algorithms for inferring preferences from observed actions. [Citation of inverse reinforcement learning] We would like to use such algorithms to enable AI systems to learn human preferences from observed actions. However, these algorithms typically assume that agents take actions that maximize expected utility given their preferences. This assumption of optimality is false for humans in real-world domains. Optimal sequential planning is intractable in complex environments and humans perform very rough approximations. Humans often don't know the causal structure of their environment (in contrast to MDP models). Humans are also subject to dynamic inconsistencies, as observed in procrastination, addiction and in impulsive behavior. Our project seeks to develop algorithms that learn human preferences from data despite the suboptimality of humans and the behavioral biases that influence human choice. We will test our algorithms on real-world data and compare their inferences to people's own judgments about their preferences. We will also investigate the theoretical question of whether this approach could enable an AI to learn the entirety of human values.
Techniques: Trying to find something better than inverse reinforcement learning, supervised learning from preference judgments
The future will see autonomous agents acting in the same environment as humans, in areas as diverse as driving, assistive technology, and health care. In this scenario, collective decision making will be the norm. We will study the embedding of safety constraints, moral values, and ethical principles in agents, within the context of hybrid human/agents collective decision making. We will do that by adapting current logic-based modelling and reasoning frameworks, such as soft constraints, CP-nets, and constraint-based scheduling under uncertainty. For ethical principles, we will use constraints specifying the basic ethical ``laws'', plus sophisticated prioritised and possibly context-dependent constraints over possible actions, equipped with a conflict resolution engine. To avoid reckless behavior in the face of uncertainty, we will bound the risk of violating these ethical laws. We will also replace preference aggregation with an appropriately developed constraint/value/ethics/preference fusion, an operation designed to ensure that agents' preferences are consistent with the system's safety constraints, the agents' moral values, and the ethical principles of both individual agents and the collective decision making system. We will also develop approaches to learn ethical principles for artificial intelligent agents, as well as predict possible ethical violations.
Techniques: Top-down design, obeying ethical principles/laws, learning ethical principles
The objectives of the proposed research are (1) to create a mathematical framework in which fundamental questions of value alignment can be investigated; (2) to develop and experiment with methods for aligning the values of a machine (whether explicitly or implicitly represented) with those of humans; (3) to understand the relationships among the degree of value alignment, the decision-making capability of the machine, and the potential loss to the human; and (4) to understand in particular the implications of the computational limitations of humans and machines for value alignment. The core of our technical approach will be a cooperative, game-theoretic extension of inverse reinforcement learning, allowing for the different action spaces of humans and machines and the varying motivations of humans; the concepts of rational metareasoning and bounded optimality will inform our investigation of the effects of computational limitations.
Techniques: Trying to find something better than inverse reinforcement learning (differently this time), creating a mathematical framework, whatever rational metareasoning is
Autonomous AI systems will need to understand human values in order to respect them. This requires having similar concepts as humans do. We will research whether AI systems can be made to learn their concepts in the same way as humans learn theirs. Both human concepts and the representations of deep learning models seem to involve a hierarchical structure, among other similarities. For this reason, we will attempt to apply existing deep learning methodologies for learning what we call moral concepts, concepts through which moral values are defined. In addition, we will investigate the extent to which reinforcement learning affects the development of our concepts and values.
Techniques: Trying to identify learned moral concepts, unsupervised learning
The elephant in the room is that making judgments that always respect human preferences is nearly FAI-complete. Application of human ethics is dependent on human preferences in general, which are dependent on a model of the world and how actions impact it. Calling an action ethical also can also depend on the space of possible actions, requiring a good judgment-maker to be capable of search for good actions. Any "moral AI" we build with our current understanding is going to have to be limited and/or unsatisfactory.
Limitations might be things like judging which of two actions is "more correct" rather than finding correct actions, only taking input in terms of one paragraph-worth of words, or only producing good outputs for situations similar to some combination of trained situations.
Two of the proposals are centered on top-down construction of a system for making ethical judgments. Designing a system by hand, it's nigh-impossible to capture the subtleties of human values. Relatedly, it seems weak at generalization to novel situations, unless the specific sort of generalization has been forseen and covered. The good points of a top down approach are that it can capture things that are important, but are only a small part of the description, or are not easily identified by statistical properties. A top-down model of ethics might be used as a fail-safe, sometimes noticing when something undesirable is happening, or as a starting point for a richer learned model of human preferences.
Other proposals are inspired by inverse reinforcement learning. Inverse reinforcement learning seems like the sort of thing we want - it observes actions and infers preferences - but it's very limited. The problem of having to know a very good model of the world in order to be good at human preferences rears its head here. There are also likely unforseen technical problems in ensuring that the thing it learns is actually human preferences (rather than human foibles, or irrelevant patterns) - though this is, in part, why this research should be carried out now.
Some proposals want to take advantage of learning using neural networks, trained on peoples' actions or judgments. This sort of approach is very good at discovering patterns, but not so good at treating patterns as a consequence of underlying structure. Such a learner might be useful as a heuristic, or as a way to fill in a more complicated, specialized architecture. For this approach like the others, it seems important to make the most progress toward learning human values in a way that doesn't require a very good model of the world.
Limited agents need approximate induction
[This post borders on some well-trodden ground in information theory and machine learning, so ideas in this post have an above-average chance of having already been stated elsewhere, by professionals, better. EDIT: As it turns out, this is largely the case, under the subjects of the justifications for MML prediction and Kolmogorov-simple PAC-learning.]
I: Introduction
I am fascinated by methods of thinking that work for well-understood reasons - that follow the steps of a mathematically elegant dance. If one has infinite computing power the method of choice is something like Solomonoff induction, which is provably ideal in a certain way at predicting the world. But if you have limited computing power, the choreography is harder to find.
To do Solomonoff induction, you search through all Turing machine hypotheses to find the ones that exactly output your data so far, then use the weighted average of those perfect retrodictors to predict the next time step. So the naivest way to build an ideal limited agent is to merely search through lots of hypotheses (chosen from some simple set) rather than all of them, and only run each Turing machine for time less than some limit. At least it's guaranteed to work in the limit of large computing power, which ain't nothing.
Suppose then that we take this nice elegant algorithm for a general predictor, and we implement it on today's largest supercomputer, and we show it the stock market prices from the last 50 years to try to predict stocks and get very rich. What happens?
Bupkis happens, that's what. Our Solomonoff predictor tries a whole lot of Turing machines and then runs out of time before finding any useful hypotheses that can perfectly replicate 50 years of stock prices. This is because such useful hypotheses are very, very, very rare.
We might then turn to the burgeoning field of logical uncertainty, which has a major goal of handling intractable math problems in an elegant and timely manner. We are logically uncertain about what distribution Solomonoff induction will output, so can we just average over that logical uncertainty to get some expected stock prices?
The trouble with this is that current logical uncertainty methods rely on proofs that certain outputs are impossible or contradictory. For simple questions this can narrow down the answers, but for complicated problems it becomes intractable, replacing the hard problem of evaluating lots of Turing machines with the hard problem of searching through lots and lots of proofs about lots of Turing machines - and so again our predictor runs out of time before becoming useful.
In practice, the methods we've found to work don't look very much like Solomonoff induction. Successful methods don't take the data as-is, but instead throw some of it away: curve fitting and smoothing data, filtering out hard-to-understand signals as noise, and using predictive models that approximate reality imperfectly. The sorts of things that people trying to predict stocks are already doing. These methods are vital to improve computational tractability, but are difficult (to my knowledge) to fit into a framework as general as Solomonoff induction.
II: Rambling
Suppose that our AI builds a lot of models of the world, including lossy models. How should it decide which models are best to use for predicting the world? Ideally we'd like to make a tradeoff between the accuracy of the model, measured in the expected utility of how accurate you expect the model's predictions to be, and the cost of the time and energy used to make the prediction.
Once we know how to tell good models, the last piece would be for our agent to make the explore/exploit tradeoff between searching for better models and using its current best.
There are various techniques to estimate resource usage, but how does one estimate accuracy?
Here was my first thought: If you know how much information you're losing (e.g. by binning data), for discrete distributions this sets the Shannon information of the ideal value (given by Solomonoff prediction) given the predicted value. This uses the relationship between information in bits of data and Shannon information that determines how sharp your probability distribution is allowed to be.
But with no guarantees about the normality (or similar niceness properties) of the ideal value given the prediction, this isn't very helpful. The problem is highlighted by hurricane prediction. If hurricanes behaved nicely as we threw away information, weather models would just be small, high-entropy deviations from reality. Instead, hurricanes can change route greatly even with small differences in initial conditions.
The failure of the above approach can be explained in a very general way: it uses too little information about the model and the data, only the amount of information thrown away. To do better, our agent has to learn a lot from its training data - a subject that workers in AI have already been hard at work on. On the one hand, it's a great sign if we can eventually connect ideal agents to current successful algorithms. On the other, doing so elegantly seems like a hard problem.
To sum up in the blandest possible way: If we want to build successful predictors of the future with limited resources, they should use their experience to learn approximate models of the world.
The real trick, though, is going to be to set this on a solid foundation. What makes a successful method of picking models? As we lack access to the future (yet! Growth mindset!), we can't grade models based on their future predictions unless we descend to solipsism and grade models against models. Thus we're left with grading models based on how well they retrodict the data so far. Sound familiar? The foundation we want seems like an analogue to Solomonoff induction, one that works for known reasons but doesn't require perfection.
III: An Example
Here's a paradigm that might or might not be a step in the right direction, but at least gestures at what I mean.
The first piece of the puzzle is that a model that gets proportion P of training bits wrong can be converted to a Solomonoff-accepted perfectly-precise model just by specifying the bits it gets wrong. Suppose we break the model output (with total length N) into chunks of size L, and prefix each chunk with the locations of the wrong bits in that chunk. Then the extra data required to rectify an approximate model is at most N/L·log(P·L)+N·P·log(L). Then the hypothesis where the model is right about the next bit is simpler than the hypothesis when it's wrong, because when the model is right you don't have to spend ~log(L) bits correcting it.
In this way, Solomonoff induction natively cares about some approximate models' predictions. There are some interesting details here that are outside the focus of this particular post. Does using the optimal chunk length lead to Solomonoff induction reflecting model accuracy correctly? What are some better schemes for rectifying models that handle things like models that output probabilities? The point is just that even if your model is wrong on fraction P of the training data, Solomonoff induction will still promote it as long as it's simpler than N-N/L·log(P·L)-N·P·log(L).
The second piece of the puzzle is that induction can be done over processed functions of observations, like smoothing the data or filtering difficult-to-predict parts (noise) out. If this processing increases the accuracy of models, we can use this to make high-accuracy models of functions the training data, and then use those models to predict the the processed future observations as above.
These two pieces allow an agent to use approximate models, and to throw away some of its information, and still have its predictions work for the same reason as Solomonoff induction. We can use this paradigm to interpret what an algorithm like curve fitting is doing - the fitted curve is a high-accuracy retrodiction of some smoothed function of the data, which therefore does a good job of predicting what that smoothed function will be in the future.
There are some issues here. If a model that you are using is not the simplest, it might have overfitting problems (though perhaps you can fix this just by throwing away more information than naively appears necessary) or systematic bias. More generally, we haven't explored how models get chosen; we've made the problem easier to brute force but we need to understand non-brute force search methods and what their foundations are. It's a useful habit to keep in mind what actually works for humans - as someone put it to me recently, "humans can make models they understand that work for reasons they understand."
Furthermore, this doesn't seem to capture reductionism well. If our agent learns some laws of physics and then is faced with a big complicated situation it needs to use a simplified model to make a prediction about, it should still in some sense "believe in the laws of physics," and not believe that this complicated situation violates the laws physics even if its current best model is independent of physics.
IV: Logical Uncertainty
It may be possible to relate this back to logical uncertainty - where by "this" I mean the general thesis of predicting the future by building models that are allowed to be imperfect, not the specific example in part III. Soares and Fallenstein use the example of a complex Rube Goldberg machine that deposits a ball into one of several chutes. Given the design of the machine and the laws of physics, suppose that one can in principle predict the output of this machine, but that the problem is much too hard for our computer to do. So rather than having a deterministic method that outputs the right answer, a "logical uncertainty method" in this problem is one that, with a reasonable amount of resources spent, takes in the description of the machine and the laws of physics, and gives a probability distribution over the machine's outputs.
Meanwhile, suppose that we take an approximately inductive predictor and somehow teach it the the laws of physics, then ask it to predict the machine. We'd like it to make predictions via some appropriately simplified folk model of physics. If this model gives a probability distribution over outcomes - like in the simple case of "if you flip this coin in this exact way, it has a 50% shot at landing heads" - doesn't that make it a logical uncertainty method? But note that the probability distribution returned by a single model is not actually the uncertainty introduced by replacing an ideal predictor with a resource-limited predictor. So any measurement of logical uncertainty has to factor in the uncertainty between models, not just the uncertainty within models.
Again, we're back to looking for some prediction method that weights models with some goodness metric more forgiving than just using perfectly-retrodicting Turing machines, and which outputs a probability distribution that includes model uncertainty. But can we apply this to mathematical questions, and not just Rube Goldberg machines? Is there some way to subtract away the machine and leave the math?
Suppose that our approximate predictor was fed math problems and solutions, and built simple, tractable programs to explain its observations. For easy math problems a successful model can just be a Turing machine that finds the right answer. As the math problems get more intractable, successful models will start to become logical uncertainty methods, like how we can't predict a large prime number exactly, but we can predict it's last digit is 1, 3, 7, or 9. Within this realm we have something like low-level reductionism, where even though we can't find a proof of the right answer, we still want to act as if mathematical proofs work and all else is ignorance, and this will help us make successful predictions.
Then we have complicated problems that seem to be beyond this realm, like P=NP. Humans certainly seem to have generated some strong opinions about P=NP without dependence on mathematical proofs narrowing down the options. It seems to such humans that the genuinely right procedure to follow is that, since we've searched long and hard for a fast algorithm for NP-complete problems without success, we should update in the direction that no such algorithm exists. In approximate-Solomonoff-speak, it's that P!=NP is consistent with a simple, tractable explanation for (a recognizable subset of) our observations, while P=NP is only consistent with more complicated tractable explanations. We could absolutely make a predictor that reasons this way - it just sets a few degrees of freedom. But is it the right way to reason?
For one thing, this seems like it's following Gaifman's proposed property of logical uncertainty, that seeing enough examples of something should convince you of it with probability 1 - which has been shown to be "too strong" in some sense (it assigns probability 0 to some true statements - though even this could be okay if those statements are infinitely dilute). Does the most straightforward implementation actually have the Gaifman condition, or not? (I'm sorry, ma'am. Your daughter has... the Gaifman condition.)
This inductive view of logical uncertainty lacks the consistent nature of many other approaches - if it works, it does so by changing approaches to suit the problem at hand. This is bad if you want your logical uncertainty methods to be based on a simple prior followed by some kind of updating procedure. But logical uncertainty is supposed to be practical, after all, and at least this is a simple meta-procedure.
V: Questions
Thanks for reading this post. In conclusion, here are some of my questions:
What's the role of Solomonoff induction in approximate induction? Is Solomonoff induction doing all of the work, or is it possible to make useful predictions using tractable hypotheses Solomonoff induction would exclude, or excluding intractable hypotheses Solomonoff induction would have to include?
Somehow we have to pick out models to promote to attention in the first place. What properties make a process for this good or bad? What methods for picking models can be shown to still lead to making useful predictions - and not merely in the limit of lots of computing time?
Are humans doing the right thing by making models they understand that work for reasons they understand? What's up with that reductionism problem anyhow?
Is it possible to formalize the predictor discussed in the context of logical uncertainty? Does it have to fulfill Gaifman's condition if it finds patterns in things like P!=NP?
Selfish preferences and self-modification
One question I've had recently is "Are agents acting on selfish preferences doomed to having conflicts with other versions of themselves?" A major motivation of TDT and UDT was the ability to just do the right thing without having to be tied up with precommitments made by your past self - and to trust that your future self would just do the right thing, without you having to tie them up with precommitments. Is this an impossible dream in anthropic problems?
In my recent post, I talked about preferences where "if you are one of two copies and I give the other copy a candy bar, your selfish desires for eating candy are unfulfilled." If you would buy a candy bar for a dollar but not buy your copy a candy bar, this is exactly a case of strategy ranking depending on indexical information.
This dependence on indexical information is inequivalent with UDT, and thus incompatible with peace and harmony.
To be thorough, consider an experiment where I am forked into two copies, A and B. Both have a button in front of them, and 10 candies in their account. If A presses the button, it deducts 1 candy from A. But if B presses the button, it removes 1 candy from B and gives 5 candies to A.
Before the experiment begins, I want my descendants to press the button 10 times (assuming candies come in units such that my utility is linear). In fact, after the copies wake up but before they know which is which, they want to press the button!
The model of selfish preferences that is not UDT-compatible looks like this: once A and B know who is who, A wants B to press the button but B doesn't want to do it. And so earlier, I should try and make precommitments to force B to press the button.
But suppose that we simply decided to use a different model. A model of peace and harmony and, like, free love, where I just maximize the average (or total, if we specify an arbitrary zero point) amount of utility that myselves have. And so B just presses the button.
(It's like non-UDT selfish copies can make all Pareto improvements, but not all average improvements)
Is the peace-and-love model still a selfish preference? It sure seems different from the every-copy-for-themself algorithm. But on the other hand, I'm doing it for myself, in a sense.
And at least this way I don't have to waste time with precomittment. In fact, self-modifying to this form of preferences is such an effective action that conflicting preferences are self-destructive. If I have selfish preferences now but I want my copies to cooperate in the future, I'll try to become an agent who values copies of myself - so long as they date from after the time of my self-modification.
If you recall, I made an argument in favor of averaging the utility of future causal descendants when calculating expected utility, based on this being the fixed point of selfish preferences under modification when confronted with Jan's tropical paradise. But if selfish preferences are unstable under self-modification in a more intrinsic way, this rather goes out the window.
Right now I think of selfish values as a somewhat anything-goes space occupied by non-self-modified agents like me and you. But it feels uncertain. On the mutant third hand, what sort of arguments would convince me that the peace-and-love model actually captures my selfish preferences?
Treating anthropic selfish preferences as an extension of TDT
I
When preferences are selfless, anthropic problems are easily solved by a change of perspective. For example, if we do a Sleeping Beauty experiment for charity, all Sleeping Beauty has to do is follow the strategy that, from the charity's perspective, gets them the most money. This turns out to be an easy problem to solve, because the answer doesn't depend on Sleeping Beauty's subjective perception.
But selfish preferences - like being at a comfortable temperature, eating a candy bar, or going skydiving - are trickier, because they do rely on the agent's subjective experience. This trickiness really shines through when there are actions that can change the number of copies. For recent posts about these sorts of situations, see Pallas' sim game and Jan_Ryzmkowski's tropical paradise. I'm going to propose a model that makes answering these sorts of questions almost as easy as playing for charity.
To quote Jan's problem:
It's a cold cold winter. Radiators are hardly working, but it's not why you're sitting so anxiously in your chair. The real reason is that tomorrow is your assigned upload, and you just can't wait to leave your corporality behind. "Oh, I'm so sick of having a body, especially now. I'm freezing!" you think to yourself, "I wish I were already uploaded and could just pop myself off to a tropical island."
And now it strikes you. It's a weird solution, but it feels so appealing. You make a solemn oath (you'd say one in million chance you'd break it), that soon after upload you will simulate this exact scene a thousand times simultaneously and when the clock strikes 11 AM, you're gonna be transposed to a Hawaiian beach, with a fancy drink in your hand.
It's 10:59 on the clock. What's the probability that you'd be in a tropical paradise in one minute?
Just as computer programs or brains can split, they ought to be able to merge. If we imagine a version of the Ebborian species that computes digitally, so that the brains remain synchronized so long as they go on getting the same sensory inputs, then we ought to be able to put two brains back together along the thickness, after dividing them. In the case of computer programs, we should be able to perform an operation where we compare each two bits in the program, and if they are the same, copy them, and if they are different, delete the whole program. (This seems to establish an equal causal dependency of the final program on the two original programs that went into it. E.g., if you test the causal dependency via counterfactuals, then disturbing any bit of the two originals, results in the final program being completely different (namely deleted).)
How many people am I?
Strongly related: the Ebborians
Imagine mapping my brain into two interpenetrating networks. For each brain cell, half of it goes to one map and half to the other. For each connection between cells, half of each connection goes to one map and half to the other. We can call these two mapped out halves Manfred One and Manfred Two. Because neurons are classical, as I think, both of these maps change together. They contain the full pattern of my thoughts. (This situation is even more clear in the Ebborians, who can literally split down the middle.)
So how many people am I? Are Manfred One and Manfred Two both people? Of course, once we have two, why stop there - are there thousands of Manfreds in here, with "me" as only one of them? Put like that it sounds a little overwrought - what's really going on here is the question of what physical system corresponds to "I" in english statements like "I wake up." This may matter.
The impact on anthropic probabilities is somewhat straightforward. With everyday definitions of "I wake up," I wake up just once per day no matter how big my head is. But if the "I" in that sentence is some constant-size physical pattern, then "I wake up" is an event that happens more times if my head is bigger. And so using the variable people-number definition, I expect to wake up with a gigantic head.
The impact on decisions is less big. If I'm in this head with a bunch of other Manfreds, we're all on the same page - it's a non-anthropic problem of coordinated decision-making. For example, if I were to make any monetary bets about my head size, and then donate profits to charity, no matter what definition I'm using, I should bet as if my head size didn't affect anthropic probabilities. So to some extent the real point of this effect is that it is a way anthropic probabilities can be ill-defined. On the other hand, what about preferences that depend directly on person-numbers like how to value people with different head sizes? Or for vegetarians, should we care more about cows than chickens, because each cow is more animals than a chicken is?
According to my common sense, it seems like my body has just one person in it. Why does my common sense think that? I think there are two answers, one unhelpful and one helpful.
The first answer is evolution. Having kids is an action that's independent of what physical system we identify with "I," and so my ancestors never found modeling their bodies as being multiple people useful.
The second answer is causality. Manfred One and Manfred Two are causally distinct from two copies of me in separate bodies but the same input/output. If a difference between the two separated copies arose somehow, (reminiscent of Dennett's factual account) henceforth the two bodies would do and say different things and have different brain states. But if some difference arises between Manfred One and Manfred Two, it is erased by diffusion.
Which is to say, the map that is Manfred One is statically the same pattern as my whole brain, but it's causally different. So is "I" the pattern, or is "I" the causal system?
In this sort of situation I am happy to stick with common sense, and thus when I say me, I think the causal system is referring to the causal system. But I'm not very sure.
Going back to the Ebborians, one interesting thing about that post is the conflict between common sense and common sense - it seems like common sense that each Ebborian is equally much one person, but it also seems like common sense that if you looked at an Ebborian dividing, there doesn't seem to be a moment where the amount of subjective experience should change, and so amount of subjective experience should be proportional to thickness. But as it is said, just because there are two opposing ideas doesn't mean one of them is right.
On the questions of subjective experience raised in that post, I think this mostly gets cleared up by precise description an anthropic narrowness. I'm unsure of the relative sizes of this margin and the proof, but the sketch is to replace a mysterious "subjective experience" that spans copies with individual experiences of people who are using a TDT-like theory to choose so that they individually achieve good outcomes given their existence.
Meetup : Urbana-Champaign: Finishing Up
Discussion article for the meetup : Urbana-Champaign: Finishing Up
This is likely the last meetup of the semester. So let's try and look back on the past and collect some reminders of things that might be useful.
Discussion article for the meetup : Urbana-Champaign: Finishing Up
Meetup : Urbana-Champaign: Seeking advice
Discussion article for the meetup : Urbana-Champaign: Seeking advice
Sometimes, one might even say usually, other people know more than me. This is great, because then they can give me advice.
Let's talk about how to evaluate advice resources, and how to ask, and reinforce how useful this can be.
Discussion article for the meetup : Urbana-Champaign: Seeking advice
View more: Next
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)