David James

My top interest is AI safety, followed by reinforcement learning. My professional background is in software engineering, computer science, machine learning. I have degrees in electrical engineering, liberal arts, and public policy. I currently live in the Washington, DC metro area; before that, I lived in Berkeley for about five years.

Wiki Contributions

Comments

Sorted by

Note that this is different from the (also very interesting) question of what LLMs, or the transformer architecture, are capable of accomplishing in a single forward pass. Here we're talking about what they can do under typical auto-regressive conditions like chat.


I would appreciate if the community here could point me to research that agrees or disagrees with my claim and conclusions, below.

Claim: one pass through a transformer (of a given size) can only do a finite number of reasoning steps.

Therefore: If we want an agent that can plan over an unbounded number of steps (e.g. one that does tree-search), it will need some component that can do an arbitrary number of iterative or recursive steps.

Sub-claim: The above claim does not conflict with the Universal Approximation Theorem.

Here is an example of a systems dynamics diagram showing some of the key feedback loops I see. We could discuss various narratives around it and what to change (add, subtract, modify).

┌───── to the degree it is perceived as unsafe ◀──────────┐                   
│          ┌──── economic factors ◀─────────┐             │                   
│        + ▼                                │             │                   
│      ┌───────┐     ┌───────────┐          │             │         ┌────────┐
│      │people │     │ effort to │      ┌───────┐    ┌─────────┐    │   AI   │
▼   -  │working│   + │make AI as │    + │  AI   │  + │potential│  + │becomes │
├─────▶│  in   │────▶│powerful as│─────▶│ power │───▶│   for   │───▶│  too   │
│      │general│     │ possible  │      └───────┘    │unsafe AI│    │powerful│
│      │  AI   │     └───────────┘          │        └─────────┘    └────────┘
│      └───────┘                            │                                 
│          │ net movement                   │ e.g. use AI to reason
│        + ▼                                │      about AI safety
│     ┌────────┐                          + ▼                                 
│     │ people │     ┌────────┐      ┌─────────────┐              ┌──────────┐
│   + │working │   + │ effort │    + │understanding│            + │alignment │
└────▶│ in AI  │────▶│for safe│─────▶│of AI safety │─────────────▶│  solved  │
      │ safety │     │   AI   │      └─────────────┘              └──────────┘
      └────────┘     └────────┘             │                                 
         + ▲                                │                                 
           └─── success begets interest ◀───┘

I find this style of thinking particularly constructive.

  • For any two nodes, you can see a visual relationship (or lack thereof) and ask "what influence do these have on each other and why?".
  • The act of summarization cuts out chaff.
  • It is harder to fool yourself about the completeness of your analysis.
  • It is easier to get to core areas of confusion or disagreement with others.

Personally, I find verbal reasoning workable for "local" (pairwise) reasoning but quite constraining for systemic thinking.

If nothing else, I hope this example shows how easily key feedback loops get overlooked. How many of us claim to have... (a) some technical expertise in positive and negative feedback? (b) interest in Bayes nets? So why don't we take the time to write out our diagrams? How can we do better?

P.S. There are major oversights in the diagram above, such as economic factors. This is not a limitation of the technique itself -- it is a limitation of the space and effort I've put into it. I have many other such diagrams in the works.

I like having a list of small, useful things to do that tend to pay off in the long run, like:

  • go to the grocery store to make sure you have fresh fruits and vegetables
  • mediate for 10 minutes
  • do pushups and sit ups
  • journal for 10 minutes

When my brain feels cluttered, it is nice to have a list of time-boxed simple tasks that don’t require planning or assessment.

Verify human designs and automatically create AI-generated designs which provably cannot be opened by mechanical picking.

Such a proof would be subject to its definition of "mechanical picking" and a sufficiently accurate physics model. (For example, would an electronically-controllable key-looking object with adjustable key-cut depths with pressure sensors qualify as a "pick"?) 

I don't dispute the value of formal proofs for safety. If accomplished, they move the conversation to "is the proof correct?" and "are we proving the right thing?". Both are steps in the right direction, I think.

Thanks for the references; I'll need some time to review them. In the meanwhile, I'll make some quick responses.

As a side note, I'm not sure how tree search comes into play; in what way does tree search require unbounded steps that doesn't apply equally to linear search?

I intended tree search as just one example, since minimax tree search is a common example for game-based RL research.

No finite agent, recursive or otherwise, can plan over an unbounded number of steps in finite time...

In general, I agree. Though there are notable exceptions for cases such as (not mutually exclusive):

  • a closed form solution is found (for example, where a time-based simulation can calculate some quantity at an any arbitrary time step using the same amount of computation)

  • approximate solutions using a fixed number of computation steps are viable

  • a greedy algorithm can select the immediate next action that is equivalent to following a longer-term planning algorithm

... so it's not immediately clear to me how iteration/recursion is fundamentally different in practice.

Yes, like I said above, I agree in general and see your point.

As I'm confident we both know, some algorithms can be written more compactly when recursion/iteration are available. I don't know how much computation theory touches on this; i.e. what classes of problems this applies to and why. I would make an intuitive guess that it is conceptually related to my point earlier about closed-form solutions.

Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.

  • If one is a consequentialist (of some flavor), one can still construct a "desirability tree" over various possible various future states. Sure, the uncertainty makes the problem more complex in practice, but the algorithm is still very simple. So I don't think that that a more complex universe intrinsically has anything to do with alignment per se.
    • Arguably, machines will have better computational ability to reason over a vast number of future states. In this sense, they will be more ethical according to consequentialism, provided their valuation of terminal states is aligned.
    • To be clear, of course, alignment w.r.t. the valuation of terminal states is important. But I don't think this has anything to do with a harder to predict universe. All we do with consequentialism is evaluate a particular terminal state. The complexity of how we got there doesn't matter.
    • (If you are detecting that I have doubts about the goodness and practicality of consequentialism, you would be right, but I don't think this is central to the argument here.)
    • If humans don't really carry out consequentialism like we hope they would (and surely humans are not rational enough to adhere to consequentialist ethics -- perhaps not even in principle!), we can't blame this on outer alignment, can we? This would be better described as goal misspecification.
  • If one subscribes to deontological ethics, then the problem becomes even easier. Why? One wouldn't have to reason probabilistically over various future states at all. The goodness of an action only has to do with the nature of the action itself.
  • Do you want to discuss some other kind of ethics? Is there some other flavor that would operate differentially w.r.t. outer alignment in a more versus less predictable universe?

Want to try out a thought experiment? Put that same particular human (who wanted to specify goals for an agent) in the financial scenario you mention. Then ask: how well would they do? Compare the quality of how the person would act versus how well the agent might act.

This raises related questions:

  • If the human doesn't know what they would want, it doesn't seem fair to blame the problem on alignment failure. In such a case, the problem would be a person's lack of clarity.
  • Humans are notoriously good rationalizers and may downplay their own bad decisions. Making a fair comparison between "what the human would have done" versus "what the AI agent would have done" may be quite tricky. (See the Fundamental Attribution Error a.k.a. correspondence bias.

As I understand it, the argument above doesn't account for the agent using the best information available at the time (in the future, relative to its goal specification).

I think there is some confusion around a key point. For alignment, do we need to define what an agent will do in all future scenarios? It depends what you mean.

  • In some sense, no, because in the future, the agent will have information we don't have now.
  • In some sense, yes, because we want to know (to some degree) how the agent will act with future (unknown) information. Put another way, we want to guarantee that certain properties hold about its actions.

Let's say we define an aligned agent doing what we would want, provided that we were in its shoes (i.e. knowing what it knew). Under this definition, it is indeed possible that to specify an agent's decision rule in a way that doesn't rely on long-range predictions (where predictive power gets fuzzy, like Alejandro says, due to measurement error and complexity). See also the adjacent by comment about a thermostat by eggsyntax.

Note: I'm saying "decision rule" intentionally, because even an individual human does not have a well-defined utility function. (edited)

Nevertheless, it seems wrong to say that my liver is optimising my bank balance, and more right to say that it "detoxifies various metabolites, synthesizes proteins, and produces biochemicals necessary for digestion"---even though that gives a less precise account of the liver's behaviour.

I'm not following why this is a less precise account of the liver's behavior.

I’m curious if your argument, distilled, is: fewer people skilled in technical AI work is better? Such a claim must be examined closely! Think of it from a systems dynamics point of view. We must look at more than just one relationship. (I personally try to press people to share some kind of model that isn’t presented only in words.)

Load More