The theorems in this post all say something like "if the distribution (approximately) factors according to <some DAGs>, then it also (approximately) factors according to <some other DAGs>"
So one motivating research question might be phrased as "Probability distributions have an equivalence class of Bayes nets / causal diagrams which are all compatible. But what is the structure within a given equivalence class? In particular, if we have a representative Bayes net of an equivalence class, how might we algorithmically generate other Bayes nets in that equivlance class?"
So the contributions of vnm theory are shrunken down into "intention"?
(Background: I consider myself fairly well-read w.r.t. causal incentives, not very familiar with vnm theory, and well-versed in Pearlian causality. I have gotten a sneak peak at this sequence so have a good sense of what's coming)
I'm not sure I understand VNM theory, but I would suspect the relationship is more like "VNM theory and <this agenda> are two takes on how to reason about the behavior of agents, and they both refer to utilities and Bayesian networks, but have important di...
...Distinguish two types of shutdown goals: temporary and permanent. These types of goals may differ with respect to entrenchment. AGIs that seek temporary shutdown may be incentivized to protect themselves during their temporary shutdown. Before shutting down, the AGI might set up cyber defenses that prevent humans from permanently disabling it while ‘asleep’. This is especially pressing if the AGI has a secondary goal, like paperclip manufacturing. In that case, protection from permanent disablement increases its expected goal satisfaction. On the other han
Another related work: Concept Algebra for Text-Controlled Vision Models (Discloser: while I did not author this paper, I am in the PhD lab who did, under Victor Veitch at UChicago. Any mistakes made in this comment are my own). We haven't prioritized a blog post about the paper so it makes sense that this community isn't familiar with it.
The concept algebra paper demonstrates that for text-to-image models like Stable Diffusion, there exist linear subspaces in the score embedding space, on which you can do the same manner of concept editing/control as...
Also, just to make sure we share a common understanding of Schölkopf 2021: Wouldn't you agree that asking "how do we do causality when we don't even know what level abstraction on which to define causal variables?" is beyond the "usual pearl causality story" as usually summarized in FFS posts? It certainly goes beyond Pearl's well-known works.
I don't think my claim is that "FFS is already subsumed by work in academia": as I acknowledge, FFS is a different theoretical framework than Pearl-based causality. I view them as two distinct approaches, but my claim is that they are motivated by the same question (that is, how to do causal representation learning).
It was intentional that the linked paper is an intro survey paper to the Pearl-ish approach to causal rep. learning: I mean to indicate that there are already lots of academic researchers studying the question "what does it mean to ...
Tho as a counterpoint, maybe Auto-GPT presents some opportunities to empirically test the IDA proposal? To have a decent experiment, you would need a good metric for alignment (does that exist?) and demonstrate that as you implement IDA using Auto-GPT, your metric is at least maintained, even as capabilities improve on the newer models.
I'm overall skeptical of my particular proposal however, because 1. I'm not aware of any well-rounded "alignment" metrics, and 2. you'd need to be confident that you can scale it up without losing control (because if t...
To clarify, here I'm not taking a stance on whether IDA should be central to alignment or not, simply claiming that unless you have a crux of "whether or not recursive improvement is easy to do" as the limiting factor for IDA being a good alignment strategy, your assessment of IDA should probably stay largely unchanged.
My understanding of Auto-GPT is that it strings together many GPT-4 requests, while notably also giving it access to memory and the internet. Empirically, this allocation of resources and looping seems promising for solving complex tasks, such as debugging the code of Auto-GPT itself. (For those interested, this paper discusses how to use looped transformers can serve as general-purpose computers).
But to my ears, that just sounds like an update of the form “GPT can do many tasks well”, not in the form of “Aligned oversight is tractable”. Put anot...
Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn't engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas
Ditto. I've recently started moving into interpretability / explainability and spent the past week skimming the broader literature on XAI, so the timing of this carefully cited post is quite impactful for me.
I see similar things happening...
A few thoughts:
Under the "reward as selection" framing, I find the behaviour much less confusing:
- We use reward to select for actions that led to the agent reaching the coin.
- This selects for models implementing the algorithm "move towards the coin".
- However, it also selects for models implementing the algorithm "always move to the right".
- It should therefore not be surprising you can end up with an agent that always moves to the right and not necessarily towards the coin.
I've been reconsidering the coin run example as well recently from a causal perspective, and your ...
I'm also working on extending the framework to the infinite setting and am almost finished except for conditional orthogonality for uncountable sets.
Hmm, what would be the intuition/application behind the uncountable setting? Like, when would one want that (I don't mind if it's niche, I'm just struggling to come up with anything)?
I'd be interested in seeing other matrix factorizations explored as well. Specifically, I would recommend trying nonnegative matrix factorizations: to quote the Wikipedia article:
This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered.
The added constraint may help eliminate spurious patterns: for instance, I suspect the positive/negative singular value distinction might be a red herring (based on past projects I've worked on).
I second this, that it's difficult to summarize AI-safety-relevant academic work for LW audiences. I want to highlight the symmetric difficulty of trying to summarize the mountain of blog-post-style work on the AF for academics.
In short, both groups have steep reading/learning curves that are under-appreciated when you're already familiar with it all.
Anecdotally, I've found the same said of Less Wrong / Alignment Forum posts among AI safety / EA academics: that it amounts to an echo chamber that no one else reads.
I suspect both communities are taking their collective lack of familiarity with the other as evidence that the other community isn't doing their part to disseminate their ideas properly. Of course, neither community seems particularly interested in taking the time to read up on the other, and seems to think that the other community should simply mimic their example (LWers want more LW synopses...
The causal incentives working group should get mentioned, it's directly on AI safety: though it's a bit older I gained a lot of clarity about AI safety concepts via "Modeling AGI Safety Frameworks with Causal Influence Diagrams", which is quite accessible even if you don't have a ton of training in causality.
[Warning: "cyclic" overload. I think in this post it's referring to the dynamical systems definition, i.e. variables reattain the same state later in time. I'm referring to Pearl's causality definition: variable X is functionally dependent on variable Y, which is itself functionally dependent on variable X.]
Turns out Chaos is not Linear...
I think the bigger point (which is unaddressed here) is that chaos can't arise for acyclic causal models (SCMs). Chaos can only arise when there is feedback between the variables right? Hence the characterization of chaos...
As I understand it, the proof in the appendix only assumes we're working with Bayes nets (so just factorizations of probability distributions). That is, no assumption is made that the graphs are causal in nature (they're not necessarily assumed to be the causal diagrams of SCMs) although of course the arguments still port over if we make that stronger assumption.
Is that correct?
Based on playing around recently with a similar setup (but only toy examples), I'm actually surprised you get only 85%, as I've only observed NDE=0 when I freeze the entire reasoning_trace.
My just-so ex... (read more)