1. Introduction: Clarifying the DSA-AI theses

JustinShovelain, Mateusz Bagiński

11mo

(Work done at Convergence Analysis. The ideas are due to Justin. Mateusz wrote the post. Thanks to Olga Babeeva for feedback on this post.)

In this post, we introduce the typology of structure, function, and randomness that builds on the framework introduced in the post Goodhart's Law Causal Diagrams. We aim to present a comprehensive categorization of the causes of Goodhart's problems.

But first, why do we care about this?

Goodhart's Law recap

The standard definition of Goodhart's Law is: "when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy.".

More specifically: we see a meaningful statistical relationship between the values of two random variables $I$ and $T$ ,^[1] where we have some... (read 4483 more words →)

Bounded AI might be viable

Mateusz Bagiński

Mateusz Bagiński, JustinShovelain

(Work done at Convergence Analysis. Mateusz wrote the post and is responsible for most of the ideas with Justin helping to think it through. Thanks to Olga Babeeva for the feedback on this post.)

1. Motivation

Suppose the perspective of pausing or significantly slowing down AI progress or solving the technical problems necessary to ensure that arbitrarily strong AI has good effects on humanity (in time, before we get such systems) both look gloomy.^[1] What options do we have left?

Adam Shimi presents a useful frame on the alignment problem in Abstracting The Hardness of Alignment: Unbounded Atomic Optimization:

alignment [is] the problem of dealing with impact on the world (optimization) that is both of unknown magnitude (unbounded) and non-interruptible (atomic).

If the problem is about some... (read 5941 more words →)

JustinShovelain1y

I agree.

Anthropic's marginal contribution to safety (compared to what we would have in a world without Anthropic) probably doesn't offset Anthropic's contribution to the AI race.

I think there are more worlds where Anthropic is contributing to the race in a negative fashion than there are worlds where Anthropic's marginal safety improvement over OpenAI/DeepMind-ish orgs is critical for securing a good future with AGI (weighing things according to the impact sizes and probabilities).

-6

JustinShovelain1y

More generally you can use the following typology to inspire creating more interventions.

Interventions points to change/form an AGI company and its surroundings towards safer x-risk results (I've used this in advising startups on AI safety, it is also related to my post on positions where people can be in the loop):

Type of organization: nonprofit, public benefit organization, have a partner non-profit, join the government
Rules of organization, event triggers:
- Rules:
  - x-risk mission statement
  - x-risk strategic plan
- Triggering events:
  - Gets very big: windfall clause
  - Gets sold to another party: ethics board, restrictions on potential sale
  - Value drift: reboot board and CEOs, shut it down, allocate more resources to safety, build a new company, put the ethics board in charge, build a

... (read 422 more words →)

JustinShovelain1y

Thanks for asking the question!

Some things I'd especially like to see change (in as much as I know what is happening) are:

Making more use of available options to improve AI safety (I think there are more than I get the impression that Anthropic thinks. For instance, 30% of funds could be allocated to AI safety research if framed well and it would probably be below the noise threshold/froth of VC investing. Also, there probably is a fair degree of freedom in socially promoting concern around unaligned AGI.)
Explicit ways to handle various types of events like organizational value drift, hostile government takeover, organization get's sold or unaligned investors have control, another AGI company

Boxing an agent more intelligent than ourselves is daunting, but information theory, thermodynamics, and control theory provide us with tools that can fundamentally constrain agents independent of their intelligence. In particular, we may be able to contain an AI by limiting its access to information

Constraining output and inputs

Superintelligent AI has the potential to both help and harm us. One way to minimize the harm is boxing the AI: containing it so that it can’t freely influence its external world, ideally while preserving its potential to help.

Boxing is complicated. Destroying the AI will certainly prevent it from influencing anything. Throwing it into a black hole might causally disconnect it from the rest of the... (read 2012 more words →)

Replying toSome background for reasoning about dual-use alignment research

Some background for reasoning about dual-use alignment research

Gotcha. What determines the "ratios" is some sort of underlying causal structure of which some aspects can be summarized by a tech tree. For thinking about the causal structure you may also like this post: https://forum.effectivealtruism.org/posts/TfRexamDYBqSwg7er/causal-diagrams-of-the-paths-to-existential-catastrophe

The risk-reward tradeoff of interpretability research

Interpretability research is conducted to improve our understanding of AI. Many see interpretability as essential for AI safety, but recently some have argued that it can also increase the risk posed by AI by facilitating improved AI capabilities. We agree, and in this post, we’ll explain why, as well as how risks can be reduced.

What is interpretability research?

The more complex a model is, the harder it is for humans to understand. It’s easy to get why a tic-tac-toe AI makes a move, but understanding why AlphaGo makes particular moves in Go is beyond our comprehension. We can describe properties of its algorithm and training - convolutional neural nets something something Monte Carlo tree searches -... (read 1669 more words →)

Replying toSome background for reasoning about dual-use alignment research

Some background for reasoning about dual-use alignment research

Complementary ideas to this article:

https://www.lesswrong.com/posts/BfKQGYJBwdHfik4Kd/fai-research-constraints-and-agi-side-effects: (the origin for the fuel tank metaphor Raemon refers to in these comments)
Extending things further to handle higher order derivatives and putting things within a cohesive space: https://forum.effectivealtruism.org/posts/TCxik4KvTgGzMowP9/state-space-of-x-risk-trajectories
A typology for mapping downside risks: https://www.lesswrong.com/posts/RY9XYoqPeMc8W8zbH/mapping-downside-risks-and-information-hazards
A set of potential responses for what to do with potentially dangerous developments and a heuristic for triggering that evaluation: https://www.lesswrong.com/posts/6ur8vDX6ApAXrRN3t/information-hazards-why-you-should-care-and-what-you-can-do
A general heuristic for what technology to develop and how to distribute it: https://forum.effectivealtruism.org/posts/4oGYbvcy2SRHTWgWk/improving-the-future-by-influencing-actors-benevolence
A coherence focused framework from which is more fundamental than the link just above and from which it can be derived: https://www.lesswrong.com/posts/AtwPwD6PBsqfpCsHE/aligning-ai-by-optimizing-for-wisdom

Aligning AI by optimizing for "wisdom"

Improving the safety of AI evals

In this post, we’ll introduce wisdom as a measure of the benevolence and internal coherence of an arbitrary agent. We’ll define several factors, such as the agent’s values, plans, evidence, and alignment with human values, and then define wisdom as consistency within and between these factors. We believe this is a useful optimization target for aligning AI.

Considering the wisdom of intelligent agents

To define our property, we’ll introduce certain coherence factors, such as the agent's beliefs, plans, actions, evidence, and so on, and for each factor determine:

Is the agent self-consistent with regard to this factor? For example, do the agent's actions undermine or support each other?
Is the agent consistent between factors? For example, do the agent's actions match

... (read 3301 more words →)

Many organizations are developing and using AI evaluations, “evals”, to assess the capability, alignment, and safety of AI. However, evals are not entirely innocuous, and we believe the risks they pose are neglected. In this article, we’ll outline some of the risks posed by doing AI evals, and suggest a strategy to improve their safety.

What are AI evals?

Different organizations are trying to improve the safety of AI in many different ways. One approach is evals: evaluations of AI systems that assess factors such as capability, safety, and alignment. Several prominent AI organizations are developing evals:

The Alignment Research Centre’s Evals project, which is limited to evaluating capabilities at the moment, but may turn to evaluating alignment in

... (read 1929 more words →)

Keep humans in the loop

In this post, we’ll explore how systems that are populated by people have some protection against emergent coordination failures in society. We’ll argue that if people are removed from the system or disempowered, technological development will drift away from improving human welfare, and coordination failures will worsen. We’ll then focus on AI development, exploring some practical techniques for AI development that keeps humans in the loop.

Moloch: why do coordination failures happen?

In Scott Alexander’s classic Meditations on Moloch, he uses the ancient deity Moloch to personify the perverse incentives that can lead a society to become miserable in spite of individuals’ desire to be non-miserable. This metaphor draws on a poem by Alan Ginsberg,... (read 2712 more words →)

Replying toDual-Useness is a Ratio

Replying toWhen you plan according to your AI timelines, should you put more weight on the median future, or the median future | eventual AI alignment success? ⚖️

Dual-Useness is a Ratio

Relatedly, here is a post going beyond the framework of a ratio of progress to the effect on the ratio of research that still needs to be done for various outcomes: https://www.lesswrong.com/posts/BfKQGYJBwdHfik4Kd/fai-research-constraints-and-agi-side-effects

Extending further one can examine higher order derivatives and curvature in a space of existential risk trajectories: https://forum.effectivealtruism.org/posts/TCxik4KvTgGzMowP9/state-space-of-x-risk-trajectories

When you plan according to your AI timelines, should you put more weight on the median future, or the median future | eventual AI alignment success? ⚖️

Roughly speaking, in terms of the actions you take, various timelines should be weighted as P(AGI in year t)*DifferenceYouCanProduceInAGIAlignmentAt(t). This produces a new, non normalized distribution of how much to prioritize each time (you can renormalize it if you wish to make it more like "probability").

Note that this is just a first approximation and there are additional subtleties.

This assumes you are optimizing for each time and possible world orthogonality but much of the time optimizing for nearby times is very similar to optimizing for a particular time.
The definition of "you" here depends on the nature of the decision maker which can vary between a group, a person, or even a person at

... (read more)

Updating Utility Functions

JustinShovelain, Joar Skalse

This post will be about AIs that “refine” their utility function over time, and how it might be possible to construct such systems without giving them undesirable properties. The discussion relates to corrigibility, value learning, and (to a lesser extent) wireheading.

We (Joar Skalse and Justin Shovelain) have spent some time discussing this topic, and we have gained a few new insights we wish to share. The aim of this post is to be a brief but explanatory summary of those insights. We will provide some motivating intuitions, a problem statement, and a possible partial solution to the problem given in the problem statement. We do not have a complete technical solution to the problem, but... (read 2106 more words →)

Replying toGoodhart's Law Causal Diagrams

JustinShovelain4y*

Goodhart's Law Causal Diagrams

I think causal diagrams naturally emerge when thinking about Goodhart's law and its implications.

I came up with the concept of Goodhart's law causal graphs above because of a presentation someone gave at the EA Hotel in late 2019 of Scott's Goodhart Taxonomy. I thought causal diagrams were a clearer way to describe some parts of the taxonomy but their relationship to the taxonomy is complex. I also just encountered the paper you and Scott wrote a couple weeks ago when getting ready to write this Good Heart Week prompted post, and I was planning in the next post to reference it when we address "causal stomping" and "function generalization error" and can... (read more)

Goodhart's Law Causal Diagrams