David James

My top interest is AI safety, followed by reinforcement learning. My professional background is in software engineering, computer science, machine learning. I have degrees in electrical engineering, liberal arts, and public policy. I currently live in the Washington, DC metro area; before that, I lived in Berkeley for about five years.

Wiki Contributions

Comments

Sorted by

Note that this is different from the (also very interesting) question of what LLMs, or the transformer architecture, are capable of accomplishing in a single forward pass. Here we're talking about what they can do under typical auto-regressive conditions like chat.


I would appreciate if the community here could point me to research that agrees or disagrees with my claim and conclusions, below.

Claim: one pass through a transformer (of a given size) can only do a finite number of reasoning steps.

Therefore: If we want an agent that can plan over an unbounded number of steps (e.g. one that does tree-search), it will need some component that can do an arbitrary number of iterative or recursive steps.

Sub-claim: The above claim does not conflict with the Universal Approximation Theorem.

Here is an example of a systems dynamics diagram showing some of the key feedback loops I see. We could discuss various narratives around it and what to change (add, subtract, modify).

┌───── to the degree it is perceived as unsafe ◀──────────┐                   
│          ┌──── economic factors ◀─────────┐             │                   
│        + ▼                                │             │                   
│      ┌───────┐     ┌───────────┐          │             │         ┌────────┐
│      │people │     │ effort to │      ┌───────┐    ┌─────────┐    │   AI   │
▼   -  │working│   + │make AI as │    + │  AI   │  + │potential│  + │becomes │
├─────▶│  in   │────▶│powerful as│─────▶│ power │───▶│   for   │───▶│  too   │
│      │general│     │ possible  │      └───────┘    │unsafe AI│    │powerful│
│      │  AI   │     └───────────┘          │        └─────────┘    └────────┘
│      └───────┘                            │                                 
│          │ net movement                   │ e.g. use AI to reason
│        + ▼                                │      about AI safety
│     ┌────────┐                          + ▼                                 
│     │ people │     ┌────────┐      ┌─────────────┐              ┌──────────┐
│   + │working │   + │ effort │    + │understanding│            + │alignment │
└────▶│ in AI  │────▶│for safe│─────▶│of AI safety │─────────────▶│  solved  │
      │ safety │     │   AI   │      └─────────────┘              └──────────┘
      └────────┘     └────────┘             │                                 
         + ▲                                │                                 
           └─── success begets interest ◀───┘

I find this style of thinking particularly constructive.

  • For any two nodes, you can see a visual relationship (or lack thereof) and ask "what influence do these have on each other and why?".
  • The act of summarization cuts out chaff.
  • It is harder to fool yourself about the completeness of your analysis.
  • It is easier to get to core areas of confusion or disagreement with others.

Personally, I find verbal reasoning workable for "local" (pairwise) reasoning but quite constraining for systemic thinking.

If nothing else, I hope this example shows how easily key feedback loops get overlooked. How many of us claim to have... (a) some technical expertise in positive and negative feedback? (b) interest in Bayes nets? So why don't we take the time to write out our diagrams? How can we do better?

P.S. There are major oversights in the diagram above, such as economic factors. This is not a limitation of the technique itself -- it is a limitation of the space and effort I've put into it. I have many other such diagrams in the works.

To clarify: the claim is that Shapley values are the only way to guarantee the set containing all four properties: {Efficiency, Symmetry, Linearity, Null player}. There are other metrics that can achieve proper subsets.

Hopefully, you have gained some intuition for why Shapley values are “fair” and why they account for interactions among players.

The article fails to make a key point: in political economy and game theory, there are many definitions of "fairness" that seem plausible at face value, especially when considered one at a time. Even if one puts normative questions to the side, there are mathematical limits and constraints as one tries to satisfy various combinations simultaneously. Keeping these in mind, you can think of this as a design problem; it takes some care to choose metrics that reinforce some set of desired norms.

Should the bill had been signed, it would have created severe enough pressures to do more with less to focus on building better and better abstractions once the limits are hit.

Ok, I see the argument. But even without such legislation, the costs of large training runs create major incentives to build better abstractions.

Does this summary capture the core argument? Physical constraints on the human brain contributed to its success relative to other animals, because it had to "do more with less" by using abstraction. Analogously, constraints on AI compute or size will encourage more abstraction, increasing the likelihood of "foom" danger.

Though I'm reasonably sure Llama license (sic) isn't preventing viewing the source

This is technically correct but irrelevant. Meta doesn't provide any source code, by which I mean the full set of precursor steps (including the data and how to process it).

Generally speaking, a license defines usage rights; it has nothing to do with if/how the thing (e.g. source code) is made available.

As a weird example, one could publish a repository with a license but no source code. This would be odd. The license would have no power to mandate the code be released; that is a separate concern.

To put it another way, a license does not obligate the owner to release or share anything, whether it be compiled software, source code, weights, etc. A license simply outlines the conditions under which the thing (e.g. source code), once released, can be used or modified.

Answer by David James31

The paper AI Control: Improving Safety Despite Intentional Subversion is a practical, important step in the right direction. It demonstrates various protocols for aiming for safety even with malicious models that know they are suspected of being dangerous.

Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger Proceedings of the 41st International Conference on Machine Learning, PMLR 235:16295-16336, 2024.

As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. To do so, safety measures either aim at making LLMs try to avoid harmful outcomes or aim at preventing LLMs from causing harmful outcomes, even if they try to cause them. In this paper, we focus on this second layer of defense. We develop and evaluate pipelines of safety techniques (protocols) that try to ensure safety despite intentional subversion - an approach we call AI control. We investigate a setting in which we want to solve a sequence of programming problems without ever submitting subtly wrong code, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate a range of protocols and red-team them by exploring strategies that the untrusted model could use to subvert them. We find that using the trusted model to edit untrusted-model code or using the untrusted model as a monitor substantially improves on simple baselines.

Related Video by Robert Miles: I highly recommend Using Dangerous AI, But Safely? released on Nov. 15, 2024.

From Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent Claims:

But what we seem to be seeing is a bit different from deep learning broadly hitting a wall. More specifically it appears to be: returns to scaling up model pretraining are plateauing.

Load More