In both alignment and capabilities research, people end up discussing the effects and feasibility of different AI architectures, and when this happens, I tend to focus on the question: Where do you get your capabilities from?

Capabilities are possible because the world has structure that leads to there being common subtasks that are often useful to solve. In order to create an AI with capabilities, there must be some process which encodes solutions to these common subtasks into that AI.

Good Old Fashioned AI and Ordinary programming

In GOFAI and ordinary programming, the programmer notices what those common subtasks are, and designs specialized algorithms to represent and solve them. This means that the programmer manually designs the capabilities for the AI, using more advanced capabilities that the programmer has.

The main difficulty with this is that there are lots of common subtasks, many of which are very complex and therefore hard to model. Manually noticing and designing all of them takes too many programmer resources to be viable.

Consequentialism

Consequentialism, broadly defined, is a general and useful way to develop capabilities.

Under consequentialism, you consider the consequences of different things you could do, and apply a search process (of which there are many) to select an approach that works. Naively, the fact that consequentialism works is tautological; if you choose the option that works, then it works. In practice, the challenge for consequentialism comes from embedded agency, with perhaps the most significant challenge being that you need some good map/model of what happens as you apply different choices, so you can know what the consequences are.

Consequentialism is a big part of how humans act. If you take some action (such as moving a chair), you usually have some purpose in mind for why you took it (such as intending to sit on the chair), and you usually wouldn't have taken the same action if you thought it would have led to a radically different outcome (such as the chair breaking).

The essence of consequentialism is captured by utility maximization. That's not to say that utility maximization covers all aspects of consequentialism; most notably, there is the possibility of subagents, which extends consequentialism and permits more behaviors than pure utility maximization does. As such we should be careful about overestimating the extent to which utility maximization captures all the relevant properties of consequentialist agency, but at the same time it does seem to capture some important ones.

Consequentialism is really broad. Evolution is a consequentialist, though a very inefficient one. It uses the real world - specifically, evolutionary history - as a model for the future. Humans have consequentialism as a significant part of our thinking. Gradient descent with backpropagation is consequentialism for differentiable computation networks. Money reifies value to permit consequentialism to act in larger-than-human world-spanning markets. Classical reinforcement learning is a form of consequentialism.

Imitation learning

The world already contains highly capable agents, namely humans and human society. This permits a shortcut to gaining capabilities: In order for humans to be capable, humans must do things that are useful, so an AI can just mimic the things humans do in order to pick up capabilities too. Essentially exploiting Aumann's agreement theorem for capabilities progress.

This is the principle behind GPT-3. People have written text that includes all sorts of useful things, either as direct knowledge (e.g. text going "The capital of Germany is Berlin") or as latent variables that generate correlations in the text (e.g. arithmetic expression like "2+2" tend to be followed by the correct answer to that arithmetic). By creating the same sorts of text that humans do, GPT gains knowledge and skills that humans have.

Reinforcement learning from human feedback

Reinforcement learning is a consequentialist problem statement, and so one can say that reinforcement learning from human feedback falls under consequentialism in the above typology. However, I think there are some additional interesting distinctions that can be drawn.

Consider the reinforcement learning applied to ChatGPT. If we simplify by ignoring a few of the components, then basically OpenAI made ChatGPT generate multiple texts, and then people rated the texts for how good they were, and OpenAI adjusted the texts to be more like the texts that people rated as being good.

Here, there arguably wasn't much direct consequentialism in the loop. If e.g. GPT suggested doing something harmful, then it is not that the human raters would have tried doing the thing and noticed its harm, nor is it that GPT would predict the harm and adjust itself. Rather, the human raters would reason theoretically to come to the conclusion that the suggestion would be harmful to enact.

This theoretical prediction about what happens as instructions are executed in a sense resembles what programmers do with GOFAI/ordinary programming, except that GPT makes it easy for non-programmers to do this reasoning, because GPT's instructions are in plain English and describe concrete scenarios, whereas programmers usually deal with abstract algorithms written in programming languages. However, I think it is fundamentally the same sort of capability gain: Use a human's capabilities to think about and decide what a computer should do.

It should be noted that this is not the only form of RLHF. There are other forms of RLHF where the AI is explicitly hooked up to reality or to a model, such that the consequences of what the AI does are not computed by a human, but instead by a non-human process that might consider consequences which the human is missing. This other form of RLHF basically uses the human to pick the optimization target for a classically consequentialist algorithm. I think the key distinction between these two forms is in evaluating actions vs outcomes/trajectories.

Unsupervised prediction of the world

Unsupervised (or self-supervised) prediction refers to when prediction algorithms are optimized to predict one part of a big naturally-occurring dataset from another part of the dataset, rather than people manually constructing a task-specific dataset to optimize the prediction algorithm for. For instance, an unsupervised model might try to predict later events from earlier events.

GPT does unsupervised prediction of human text, which as discussed above is mainly useful as a form of imitation learning. But it is also possible to improve capabilities of certain AI systems by performing unsupervised prediction of the world. For example, image classifiers can often be improved by certain unsupervised training tasks, and I am excited about recent work going into video extrapolation.

AI trained on recordings of the world rather than human text doesn't gain capabilities from mimicking capable agents in the world, because it is mostly not mimicking capable agents. Rather, I think unsupervised prediction is mainly useful because it is a way to build a map, as the unsupervised predictor learns to approximate the dynamics and distributions of reality.

At its most basic, unsupervised prediction forms a good foundation for later specializing the map to perform specific types of prediction (as in finetuning for image classification). I think as we come to better understand natural abstractions, we may increasingly come to see this as basically just another form of ordinary programming. I know I already have detailed models of what principal component analysis does, to the point where I can often just think of it as an ordinary programming tool; presumably the same will come to apply to more and more unsupervised learning algorithms.

Unsupervised prediction is also directly useful, e.g. for predicting the effects of actions for consequentialists. Many of the most exciting new approaches for consequentialist AI are model-based, and I expect this to continue as we run into the limitations of imitation learning.

Constitutional AI

Anthropic recently published a paper on "Constitutional AI". In it, they created what they called a "constitution", which basically amounts to using natural language terms such as "harmful" or "illegal" to reference what sorts of properties they would like to prevent the responses of a GPT-style language model from having. They then took some responses the language model had to some questions, and asked the language model to evaluate whether the those responses were in accordance with the constitution, and used this to fine-tune their language model. It improved the language model a bunch - but why?

It seems like we can apply the principles earlier in the post to explain it. The pretraining of the language model was a form of imitation learning that gave it the capability to recognize some forms of harm and crime, within the realm of text descriptions, as well as to answer questions in both harmful and non-harmful ways. Its ability to recognize harm and crime then gets used as a model, to consequentialistically avoid proposing things that are criminal and harmful.

Implications

In my opinion, the implications of the "Where do you get your capabilities from?" question are:

Bounded breakdown of the orthogonality thesis: A central point in alignment is the orthogonality thesis, that any amount of intelligence can be applied towards any goal. The orthogonality thesis applies straightforwardly to consequentialist optimization, but it immediately breaks down when you consider other ways of gaining capabilities, such as imitation learning or GOFAI/ordinary programming. For instance, with imitation learning, you are mimicking human actions, and doing so is useful precisely because they are already optimized for promoting human values. (h/t DragonGod who alerted me in the strongest terms that orthogonality breaks down with GPT-style training.)

Human vs far-superhuman abilities: It seems like imitation learning can get us to human level of capabilities, and a bit beyond that (because imitation learning be run massively in parallel, learning from all the best humans, and thus probably produce an artifact that has similar performance to top humans across all domains). However, it seems to me that it cannot produce truly novel (e.g. far-superhuman) capabilities, and so I would expect consequentialism to remain relevant.

In fact, it seems to me that the root of all capabilities is consequentialism. I mean this from two views:

  • When I enumerate the various known ways that capabilities have been gained, then they either seem to be getting them from elsewhere (e.g. imitation learning), or seem to be consequentialism.
  • From a theoretical point of view, no free lunch theorems teach us that capabilities are about fit with the environment. In order to develop them, you have to take information about the environment's support for them into account. Just from an informational point of view, consequentialism seems required for developing capabilities.

Thanks to Justis Mills for proofreading and feedback.

New Comment
27 comments, sorted by Click to highlight new comments since:

It seems significant to me that when a human builds an evaluator that silicon agent will later test their capabilities on the human does not need to know and can fail to know what they are "aiming for".

Say that I build a newtonian mechanics simulator which has episodes of random amounts of objects, put in random configurations and a random goal state (with a faitful representation available to the player at start). The player gets control of random degrees of freedom of some random subset of objects that happen to exist. Am I trying to teach about centripetal force? Am I trying to teach about gravity? Am I trying to teach about conservation of energy? It is quite confident that this encounrages "newtonian physics competence" and the sperhical-cow-like assumtions give structures about what it is missing from a more general physics competence. But I think it can be said that I am not encouraging any specific approach or any specific solution. Maybe this is what was meant with that "actions bs trajectories/outcomes". But I do think there is something important that it is somewhat feasible to verify that a game is (perfectly) "newtonian-faitful" while being able to miss newtonian physics concepts the training regime would grow. This is not just a matter of hitting narrow vs wide scenarios but are structurally different.

Some important institutions seems to share this "develop a proxy game" structure.

Students go to schools and then apply those skills in society. A school might be missing some subject (computer science, sex education) and that has narrower range of capability expression in society without students being any lower quality. What we teach in schools is also "opinion" or atleast theory based. Reality will not come tearing down unrepresentative schooling regimes but society has to detect and fix that actively.

Physics theories get applied to explain experiments and then deployed in society as technologies. Experiments and societal exploits run on the "same engine". Certain kinds of "spherical cow" errors are not possible. However "in the lab" is a significant consideration. Boiled water is a electricity insulator. Electric wires dropping into water puddles in society are dangerous because of conductivity. Representativeness of the lab can go wrong. The school version of "in the lab" is "by the book".

Even in the chair example the consequentialisim is reduced to rather narrow bubble. Yes, if your world-model says that the chair will break you know to withdraw from that impulse. But if your world-model says that the chair will hold you can proceed to execute this approved plan. But reality has no duty to obey your world-model and the chair might give. The upkeep and choice of world-model is a theorethical or opinion kind of stuff instead of a consequentialist one.

Consequentialism needs to build models, yes. I discuss that in the article, e.g. in the "Unsupervised prediction of the world" section, as well as here:

Naively, the fact that consequentialism works is tautological; if you choose the option that works, then it works. In practice, the challenge for consequentialism comes from embedded agency, with perhaps the most significant challenge being that you need some good map/model of what happens as you apply different choices, so you can know what the consequences are.

The link contains all sorts of limiting properties and it is hard to pick the relevant details.

For my purposes, consequntalism is a consumer only of world models and does not produce them at all. This remains if we give infinite compute, keep it perfectly separated, allow full-selfmodelling etc.

For my purposes, consequntalism is a consumer only of world models and does not produce them at all.

I mostly agree with that.

Not entirely, though - there are concepts called "intrinsic curiosity" where ML researchers use consequentialism to perform reinforcement learning towards reaching edge cases where the models break down, so that they can collect more data for their models. Similarly, expected utility maximization automatically gives you a drive to reduce uncertainty in variables that you expect to matter.

But overall I expect there will be a need to explicitly think about how to learn good world models.

The new improved model is done with reinforcement learning and not the consequentialism part. Even to the extent that a question of "How I act wrong?" is answered, what is wrong comes from the world-model. An actual inconsistency is not caught unless it is already concievable as one.

The EUM drive steers you to situations where your map is good, use the confident parts or the ones that are the least self-contradictory. It does not lead you to a better map, it does not take a bad variable and make it more representative.

The new improved model is done with reinforcement learning and not the consequentialism part.

Reinforcement learning is a form of consequentialism.

Even to the extent that a question of "How I act wrong?" is answered, what is wrong comes from the world-model. An actual inconsistency is not caught unless it is already concievable as one.

Instrinsic curiosity systems try to create ways to unsupervisedly make inconsistencies conceivable. For instance, they might train multiple models to predict the same thing, and then they can treat disagreements between the models as being inconsistencies.

The EUM drive steers you to situations where your map is good, use the confident parts or the ones that are the least self-contradictory. It does not lead you to a better map, it does not take a bad variable and make it more representative.

I don't think you can say this all that absolutely. Lots of complex stuff can happen with EUM in complex environments.

How safe I am to say such things about complex environments?

So you use choosement to act and you never vary your comparison-value-generator but just feed it with new kinds of inputs as you encounter new situations. You are in a situation you have essentially been in before. You want to do better than last time (not be mad by expecting different results). This means the tie-breaker on what you will do needs to depend on details that make this situation different from the last one. Luckily the situation is only essentially rather than totally the same so those exist. If your set-in-stone comparison-value-generator picks out the correct inessential detail you do an action better than last time.

So wasn't it important that the right inessential detail was just the way it was? But that kind of makes it important for control to have been so. So last time step you might have had some other essential choice to make, but you might have also had control over inessential details. So for the sake of the future if the future control detail is among them it is important to set it in the good position. But how can you manage that? The detail is ultimately (objectively) inessential, so you do not have any "proper" motive to pay any attention to it. Well you could be similarly superstisous to have random scratchings help you pick the right one.

All these external dependencies either require all to be unique or be conditional on some other detail also being present. Maybe you get lucky and at some point you can use a detail that is for unrelated reasons a good bet. A lot of your comparison-value-picker probably deals with essential details too. However if it only depends on essential details on some choice, you better make a choice you can repeat for eternity in those situations because there ain't gonna be improvement.

So a EUM that has a comparison-value-generator that is a function of immediately causable world-state only, learns only to the extent it can use the environment as a map and let it think for it. That such "scratch-pads" would be only-for-this-purpose mindstates still keeps it so that a lot of essential stuff about the policy is not apparent in the comparison-value-generator. And you need at install time make a very detailed module and know even for the late parts that will be used late how they need to be or atleast that they are okay to be as they are (and okay to be in the interrim).

Or you could allow the agent to change comparison-value-generator at will. Then it only needs one superstition at each junction to jump to the next one. Correspondingly any inessential dependence can mean more or different inessential dependencies at future times. Install time is still going to be pretty rough but you only need find one needle, instead of 1 per timestamp. If you manage to verify each "phase" separately you do not need to worry about unused inessential dependencies to be carried over to the next phase. The essential parts can also exists only when they are actually needed.

Choosement is not part of my definition of consequentialism.

Searching based on consequences is part of it, and you are right that in the real world you would want to update your model based on new data you learn. In the EUM framework, these updates are captured by Bayesian conditioning. There are other frameworks which capture the updates in other ways, but the basic points are basically the same.

How does "searching based on consequences" fail to ever use choosement?

The possibility of alternatives to choosement is discussed here.

Linking to totalities of very long posts have downsides comparable to writing wall-of-text replies.

I understand how "searching" can fail to be choosement when it ends up being "solving algebraicly" without actually checking any values of the open variables.

Going from abstract descriptions to more and more concrete solutions is not coupled how many elementary ground-level-concrete solutions get disregarded so it can be fast. I thought part of the worryingness of "checks every option" is that it doesn't get fooled by faulty (or non-existent) abstractions

So to me it is surprising that an agent that never considers alternative avenues gets under the umbrella "consequentialist". So an agent that changes policy if it is in pain and keeps policy if it feels pleasure, "is consequentialist" based on that its policy was caused by life-events, even if the policy is pure reflex.

There were vibes also to the effect of "this gets me what I want" is a consequentialist stance because of appearance of "gets". So 

1. push button
2. ????
3. ????
4. PROFIT!!

is consequentialist because it projects winning.

Well so you are right that a functioning consequentialist must either magically have perfect knowledge, or must have some way of observing and understanding the world to improve its knowledge. Since magic isn't real, in reality for advanced capable agents it must be the latter.

In the EUM framework, the observations and improvements in understanding are captured by Bayesian updating. In different frameworks, it may be captured by different things.

"improve knowledge" here can be "its cognition is more fit to the environment". Somebody could understand "represent the environment more" which it does not need to be.

With such wide understanding it start to liken to me "the agent isn't broken" which is not exactly structure-anticipation-limiting.

"improve knowledge" here can be "its cognition is more fit to the environment". Somebody could understand "represent the environment more" which it does not need to be.

Yes, classical Bayesian decision theory often requires a realizability assumption, which is unrealistic.

With such wide understanding it start to liken to me "the agent isn't broken" which is not exactly structure-anticipation-limiting.

Realizability is anticipation-limiting but unrealistic.

While EUM captures the core of consequentialism, it does so in a way that is not very computationally feasible and leads to certain paradoxes pushed so far. So yes, EUM is unrealistic. The details are discussed in the embedded agency post.

So is intrinsic curiocity a reinforcement learning or unsupervised learning approach?

With comparing claims of different models in the same expression language you do not need to have a dynamic model of inconsistency.

If you need the model to decide about itself where it is wrong, there is the possiblity that the model ,which can be dynamic, is poor quality about it.

What if you are wrong about what is wrong?

Suppose we are making an inconsistency detector by choosement. Consider all our previous situations. Then generate a value representing how wrong that choice was in that situation. Then stick with the one that has highest wrongness and start "here we need to improve" whatever that means.

So at a given situation how wrong was the actual choice made? Generate values of how wrong the other options would have been and then return the difference between the highest and this one. If we use the same option-evaluator as when acting, suprise-suprise, we always picked the highest value option. If we use a different one why are we not using that one in the other? Every situation being wrongness zero means where we improve is arbitrary.

So is intrinsic curiocity a reinforcement learning or unsupervised learning approach?

Intrinsic curiosity uses reinforcement learning to find places where the map is missing information, and then unsupervised learning to include that information in the map.

The new improved model is done with reinforcement learning and not the consequentialism part.

Reinforcement learning is a form of consequentialism.

Trying the rephrase to use more correct/accurate concepts.

In the step where we use what actually happened to tweak our agents world-model (this is called "interpreter"?), it is usually a straight-forward calculation (a "reflex") what kind of mental change ends up happening. There is no formation of alternatives. There is no choice. This process is essentially the same even if we use the other approaches.

I might have understood consequentialism overly narrowly (curses of -isms). So for disambiguation: choosement is creating a lot of items, forming a comparison item for each and a picker that is a function of the pool of comparison items only to pick the associated item to continue with and discarding others.

Consequentalism action choosement leaves the world-model unchanged. In incorporating feedback in a consequentalist approach there is no choosement employed and the world-model might change (+non-world-model comparison item former influence).

One could try to have an approach where choosement was used in feedback incorporation. Generate many options to randomly tweak the world-model. Then form a comparison item for each by running the action formation bit and note the utility-cardinality of the action that gets picked (reverse and take min if feedback is negative). Take the world-model-tweak with extrema action-cardinality, implement and carry on with that world-model.

Choosement could be used in supervised learning. Use different hyperparameters to get different bias and actually only use the one that is most singleminded about its result on this specific new situation.

The world-model changing parts of reinforcement learning do not come from choosement.

I might have understood consequentialism overly narrowly (curses of -isms). So for disambiguation: choosement is creating a lot of items, forming a comparison item for each and a picker that is a function of the pool of comparison items only to pick the associated item to continue with and discarding others.

I am not sure whether "choosement" here refers to a specific search algorithm, or search algorithms in general. As mentioned in the post, there are many search algorithms.

It is supposed to be a pattern that you can say whether a particular concrete algorithm or class of algoritms has or does not have.

But what pattern exactly?

def choosement(context,generator,evaluator,singlement)
  items=generator(context)
  comparison_objects=[]
  associations={}
  for  item in items:
    value=evaluator(item,context)
    comparison_objects.append(value)
    associations[value]=item
  chosen_comparison=singlement(comparison_objects)
  chosen_item=associations[chosen_comparison]
  return chosen_item

edit: allowed evaluation to know about context

This is not necessarily part of my definition of consequentialism, since it is a specific search pattern and there are other search patterns.

I am clarifying what I meant in

The new improved model is done with reinforcement learning (feedback integration) and not the consequentialism (choosement) part.

If there is consequentialism that is not based or use choosement in what you mean that would probably be pretty essential for clarification.

The possibility of alternatives to choosement is discussed here.

I'm interested in how this post should update us re: "How does pursuing RLHF help/hurt our chances of creating aligned transformative AI?", assuming that we take the post at face value and assume that it's true, reasonable, etc. (FYI I'm interested in this for weird reasons, including personal curiosity and investigating using alignment forum posts to answer alignment questions on https://elicit.org/)

Currently I think this update is: The following is true of imitation learning but not RLHF: "you are mimicking human actions, and doing so is useful precisely because they are already optimized for promoting human values." So perhaps RLHF bakes in human values less so than imitation learning does, and so TAI created with RLHF fine-tuning is potentially less likely to be aligned than TAI created via pure imitation learning.

Does that seem reasonable? Do you agree with it?

[-][anonymous]20

One of the obvious ways to improve chatGPT is to have it solve coding problems, math problems, and other problems, where the problems are reusing existing frameworks.  "go and solve this practice GRE, solve all of the problems on leetcode and codesignal".  

It would get feedback scores based on how correct it's solution was to each problem.  
 

To make a better coder specifically you could add tools that adversarially try to find input to trigger a security bug in the generated code, and have other tasks like "translate this program to these other languages".  A translation is very easy to validate for correctness.  

Presumably this would use the same API used for RLHF.   So it's a more conventional RL based AI that is built with GPT as the starting point.

This would, I assume, trivially be superhuman at the tasks that can be refined this way.

Do you mean something like this? https://arxiv.org/abs/2207.14502