All of David Reber's Comments + Replies

In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.

Based on playing around recently with a similar setup (but only toy examples), I'm actually surprised you get only 85%, as I've only observed NDE=0 when I freeze the entire reasoning_trace. 

My just-so ex... (read more)

Ah that's right. Thanks that example is quite clarifying!

also, it appears that the two diagrams in the Frankenstein Rule section differ in their d-separation of (x_1 \indep x_4 | x_5) (which doesn't hold in the the left), so these are not actually equivalent (we can't have an underlying distribution satisfy both of these diagrams)

7johnswentworth
Oh, we can totally have an underlying distribution satisfy both of these diagrams. The key is that, while the right diagram asserts (x_1 \indep x_4 | x_5), the left does not say that x_1 can't be independent of x_4 given x_5. Remember the interpretation: an underlying distribution satisfies a DAG if-and-only-if the distribution factors over that DAG. We neither assume nor prove minimality; the DAG does not need to be minimal. So, for instance, a distribution in which all five variables are unconditionally independent would satisfy every diagram over those variables. You are right that the two diagrams are not equivalent (i.e. there exist distributions which satisfy either one but not the other), and we're not claiming they're equivalent. We're just saying "assume that some distribution satisfies both of these two diagrams; what other diagrams must the distribution then satisfy?".

The theorems in this post all say something like "if the distribution (approximately) factors according to <some DAGs>, then it also (approximately) factors according to <some other DAGs>"

So one motivating research question might be phrased as "Probability distributions have an equivalence class of Bayes nets / causal diagrams which are all compatible. But what is the structure within a given equivalence class? In particular, if we have a representative Bayes net of an equivalence class, how might we algorithmically generate other Bayes nets in that equivlance class?"

Could you clarify how this relates to e.g. the PC (Peter-Clark) or FCI (Fast Causal Inference) algorithms for causal structure learning? 

Like, are you making different assumptions (than e.g. minimality, faithfulness, etc)?

5johnswentworth
First, there's a different type signature. PC or FCI-style causal structure learning typically takes in a bunch of conditional independence tests (or a function for performing conditional independence tests, or a pile of data on which to perform conditional independence tests, etc) and spits out a DAG. For this post, it's DAGs in, DAGs out. Second, a different use-case. The theorems in this post all say something like "if the distribution (approximately) factors according to <some DAGs>, then it also (approximately) factors according to <some other DAGs>". This use-case doesn't directly assume or prove anything about minimality; none of the theorems say that the distribution can't also satisfy some DAG with fewer edges. Likewise, we neither assume nor prove anything about faithfulness. (Though one could perhaps use these theorems to prove minimality and/or faithfulness in particular cases, in which case the preconditions for the proof would be that the distribution does/doesn't satisfy some DAGs.) If we wanted to use these theorems for causal structure learning, at a high level the method would look something like: * First, find some DAGs the distribution satisfies via some other method(s). * Then, use the theorems here to derive a stronger DAG which must be satisfied. For instance, we could imagine that two groups of scientists do a bunch of experiments on the same system. One group finds that the system is well-modelled by the left DAG at the top of the Frankenstein Rule section of this post, the other group finds that the system is well-modelled by the right DAG. Then, we could try to Frankenstein those two DAGs together in a way which keeps as few edges as possible.

So the contributions of vnm theory are shrunken down into "intention"?

(Background: I consider myself fairly well-read w.r.t. causal incentives, not very familiar with vnm theory, and well-versed in Pearlian causality. I have gotten a sneak peak at this sequence so have a good sense of what's coming)

I'm not sure I understand VNM theory, but I would suspect the relationship is more like "VNM theory and <this agenda> are two takes on how to reason about the behavior of agents, and they both refer to utilities and Bayesian networks, but have important di... (read more)

Distinguish two types of shutdown goals: temporary and permanent. These types of goals may differ with respect to entrenchment. AGIs that seek temporary shutdown may be incentivized to protect themselves during their temporary shutdown. Before shutting down, the AGI might set up cyber defenses that prevent humans from permanently disabling it while ‘asleep’. This is especially pressing if the AGI has a secondary goal, like paperclip manufacturing. In that case, protection from permanent disablement increases its expected goal satisfaction. On the other han

... (read more)

Another related work: Concept Algebra for Text-Controlled Vision Models (Discloser: while I did not author this paper, I am in the PhD lab who did, under Victor Veitch at UChicago. Any mistakes made in this comment are my own). We haven't prioritized a blog post about the paper so it makes sense that this community isn't familiar with it. 

The concept algebra paper demonstrates that for text-to-image models like Stable Diffusion, there exist linear subspaces in the score embedding space, on which you can do the same manner of concept editing/control as... (read more)

3Bogdan Ionut Cirstea
Seems very related: Linear Spaces of Meanings: Compositional Structures in Vision-Language Models. Notably, the (approximate) compositionality of language/reality should bode well for the scalability of linear activation engineering methods.

Also, just to make sure we share a common understanding of Schölkopf 2021: Wouldn't you agree that asking "how do we do causality when we don't even know what level abstraction on which to define causal variables?" is beyond the "usual pearl causality story" as usually summarized in FFS posts? It certainly goes beyond Pearl's well-known works.

I don't think my claim is that "FFS is already subsumed by work in academia": as I acknowledge, FFS is a different theoretical framework than Pearl-based causality. I view them as two distinct approaches, but my claim is that they are motivated by the same question (that is, how to do causal representation learning). 

It was intentional that the linked paper is an intro survey paper to the Pearl-ish  approach to causal rep. learning: I mean to indicate that there are already lots of academic researchers studying the question "what does it mean to ... (read more)

2Alexander Gietelink Oldenziel
Scott Garrabrant conceived of FFS as an extension & generalization of Pearlian causality that answers questions that are not dealt well with in the Pearlian framework. He is aware of Pearl's work and explicitly builds on it. It's not a distinct approach as much as an extension. The paper you mentioned discusses the problem of figuring out what the right variables are but poses no solution (as far as I can tell). That shouldn't surprise because the problem is very hard. Many people have thought about it but there is only one Garrabrant. I do agree with your overall perspective that people in alignment are quite insular, unaware of the literature and often reinventing the wheel.

Tho as a counterpoint, maybe Auto-GPT presents some opportunities to empirically test the IDA proposal? To have a decent experiment, you would need a good metric for alignment (does that exist?) and demonstrate that as you implement IDA using Auto-GPT, your metric is at least maintained, even as capabilities improve on the newer models. 

I'm overall skeptical of my particular proposal however, because 1. I'm not aware of any well-rounded "alignment" metrics, and 2. you'd need to be confident that you can scale it up without losing control (because if t... (read more)

4Michaël Trazzi
The evidence I'm interested goes something like: * we have more empirical ways to test IDA * it seems like future systems will decompose / delegates tasks to some sub-agents, so if we think either 1) it will be an important part of the final model that successfully recursively self-improves 2) there are non-trivial chances that this leads us to AGI before we can try other things, maybe it's high EV to focus more on IDA-like approaches?

To clarify, here I'm not taking a stance on whether IDA should be central to alignment or not, simply claiming that unless you have a crux of "whether or not recursive improvement is easy to do" as the limiting factor for IDA being a good alignment strategy, your assessment of IDA should probably stay largely unchanged.

My understanding of Auto-GPT is that it strings together many GPT-4 requests, while notably also giving it access to memory and the internet. Empirically, this allocation of resources and looping seems promising for solving complex tasks, such as debugging the code of Auto-GPT itself. (For those interested, this paper discusses how to use looped transformers can serve as general-purpose computers).

But to my ears, that just sounds like an update of the form “GPT can do many tasks well”, not in the form of “Aligned oversight is tractable”. Put anot... (read more)

3David Reber
Tho as a counterpoint, maybe Auto-GPT presents some opportunities to empirically test the IDA proposal? To have a decent experiment, you would need a good metric for alignment (does that exist?) and demonstrate that as you implement IDA using Auto-GPT, your metric is at least maintained, even as capabilities improve on the newer models.  I'm overall skeptical of my particular proposal however, because 1. I'm not aware of any well-rounded "alignment" metrics, and 2. you'd need to be confident that you can scale it up without losing control (because if the experiment fails, then by definition you've developed a more powerful AI which is less aligned). But it's plausible to me that someone could find some good use for Auto-GPT for alignment research, now that it has been developed. It's just not clear to me how you would do so in a net-positive way.
3David Reber
To clarify, here I'm not taking a stance on whether IDA should be central to alignment or not, simply claiming that unless you have a crux of "whether or not recursive improvement is easy to do" as the limiting factor for IDA being a good alignment strategy, your assessment of IDA should probably stay largely unchanged.

Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn't engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas

Ditto. I've recently started moving into interpretability / explainability and spent the past week skimming the broader literature on XAI, so the timing of this carefully cited post is quite impactful for me.

I see similar things happening... (read more)

2Alexander Gietelink Oldenziel
I was intrigued by your claim that FFS is already subsumed by work on academia. I clicked the link you provided but from a quick skim it doesn't seem to do FFS or anything beyond the usual pearl causality story as far as I can tell. Maybe I am missing something - could you provide an specific page where you think FFS is being subsumed?

A few thoughts:

  • This seems like a good angle for how to bridge AI safety a number of disciplines
  • I appreciated the effort to cite peer-reviewed sources and provide search terms that can be looked into further
  • While I'm still parsing the full validity/relevance concrete agendas suggested, they do seem to fit the form of "what relevance is there from established fields" without diluting the original AI safety motivations too much
  • Overall, it's quite long, and I would very much like to see a distilled version (say, 1/5 the length). 
    • (but that's just a moderate signal from someone who was already interested, yet still nearly bounced off)

Under the "reward as selection" framing, I find the behaviour much less confusing:

  • We use reward to select for actions that led to the agent reaching the coin.
  • This selects for models implementing the algorithm "move towards the coin".
  • However, it also selects for models implementing the algorithm "always move to the right".
  • It should therefore not be surprising you can end up with an agent that always moves to the right and not necessarily towards the coin.

 

I've been reconsidering the coin run example as well recently from a causal perspective, and your ... (read more)

I'm also working on extending the framework to the infinite setting and am almost finished except for conditional orthogonality for uncountable sets.

 

Hmm, what would be the intuition/application behind the uncountable setting? Like, when would one want that (I don't mind if it's niche, I'm just struggling to come up with anything)?

2Matthias G. Mayer
A direct application would need that you have an uncountable variable. You might want to do this if you have enough evidence to say this confidently. As a simple example imagine a real-valued graph where all your data points lie almost on the identity diagonal. You might then want to infer a variable which is the identity. As a more general application, we want to model infinities because the world is probably infinite in some aspects. We then want a theorem that tells us, that even if the underlying model is infinite, if you have enough data points then you are close enough, like with the Strong law of Large numbers, for example.

I'd be interested in seeing other matrix factorizations explored as well. Specifically, I would recommend trying nonnegative matrix factorizations: to quote the Wikipedia article:

This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered.

The added constraint may help eliminate spurious patterns: for instance, I suspect the positive/negative singular value distinction might be a red herring (based on past projects I've worked on).

I second this, that it's difficult to summarize AI-safety-relevant academic work for LW audiences. I want to highlight the symmetric difficulty of trying to summarize the mountain of blog-post-style work on the AF for academics.

In short, both groups have steep reading/learning curves that are under-appreciated when you're already familiar with it all.

Anecdotally, I've found the same said of Less Wrong / Alignment Forum posts among AI safety / EA academics: that it amounts to an echo chamber that no one else reads.

I suspect both communities are taking their collective lack of familiarity with the other as evidence that the other community isn't doing their part to disseminate their ideas properly. Of course, neither community seems particularly interested in taking the time to read up on the other, and seems to think that the other community should simply mimic their example (LWers want more LW synopses... (read more)

The causal incentives working group should get mentioned, it's directly on AI safety: though it's a bit older I gained a lot of clarity about AI safety concepts via "Modeling AGI Safety Frameworks with Causal Influence Diagrams", which is quite accessible even if you don't have a ton of training in causality.

[Warning: "cyclic" overload. I think in this post it's referring to the dynamical systems definition, i.e. variables reattain the same state later in time. I'm referring to Pearl's causality definition: variable X is functionally dependent on variable Y, which is itself functionally dependent on variable X.]

Turns out Chaos is not Linear...

I think the bigger point (which is unaddressed here) is that chaos can't arise for acyclic causal models (SCMs). Chaos can only arise when there is feedback between the variables right? Hence the characterization of chaos... (read more)

2johnswentworth
Yeah, the chaos piece predated the Telephone Theorem. The Telephone Theorem does apply just fine to chaotic systems (the Bayes Net just happens to have time symmetry), but it's way more general.

As I understand it, the proof in the appendix only assumes we're working with Bayes nets (so just factorizations of probability distributions). That is, no assumption is made that the graphs are causal in nature (they're not necessarily assumed to be the causal diagrams of SCMs) although of course the arguments still port over if we make that stronger assumption.

Is that correct?

2johnswentworth
Yup.