Comparing AI Safety-Capabilities Dilemmas to Jervis' Cooperation Under the Security Dilemma
I've been skimming some things about the Security Dilemma (specifically Offense-Defense Theory) while looking for analogies for strategic dilemmas in the AI landscape.
I want to describe a simple comparison here, lightly held (and only lightly studied)
I largely agree with the above, but commenting with my own version.
What I think companies with AI services should do:
Can be done in under a week:
Weakly positive on this one overall. I like Coase's theory of the firm, and like making analogies with it to other things. I don't think this application felt like it quite worked to me, and trying to write up why.
One thing is I think feels off is an incomplete understanding of the Coase paper. What I think the article gets correct: Coase looks at the difference between markets (economists preferred efficient mechanism) and firms / corporation, and observes that transaction costs (for people these would be contracts, but in general all tr...
This post was personally meaningful to me, and I'll try to cover that in my review while still analyzing it in the context of lesswrong articles.
I don't have much to add about the 'history of rationality' or the description of interactions of specific people.
Most of my value from this post wasn't directly from the content, but how the content connected to things outside of rationality and lesswrong. So, basically, i loved the citations.
Lesswrong is very dense in self-links and self-citations, and to a lesser degree does still have a good number of li...
I read this sequence and then went through the whole thing. Without this sequence I'd probably still be procrastinating / putting it off. I think everything else I could write in review is less important than how directly this impacted me.
Still, a review: (of the whole sequence, not just this post)
First off, it signposts well what it is and who it's for. I really appreciate when posts do that, and this clearly gives the top level focus and whats in/out.
This sequence is "How to do a thing" - a pretty big thing, with a lot of steps and bran...
Summary
Definitions
Thoughts, mostly on an alternative set of next experiments:
I find interpolations of effects to be a more intuitive way to study treatment effects, especially if I can modulate the treatment down to zero in a way that smoothly and predictably approaches the null case. It's not exactly clear to me what the "nothing going on case is", but here's some possible experiments to interpolate between it and your treatment case.
I think at this point these feel like empirical questions, which I think would be much more clearly answered by demonstrations or experiments.
Trying to encode an additional penalty on changing non-semantic information is an interesting idea.
However I think you're missing that you don't have the ability to directly compare to a reference LM in cases where you're training to improve on some performance benchmark. During training the model will change its predictions on everything to some degree (both semantic and nonsemantic content).
So your proposed d...
If what you’re saying is “any change to the distribution will change KL” — I think that’s just correct.
This also applies to changes during training where the model is learning to perform better on the objective task.
So we are expecting some amount of KL divergence already.
My claims are:
The cheapest place to hide information (due to KL) are places where the model already has high entropy (ie it is uncertain between many possible outputs)
optimization pressure will try to push this extra information into the cheapest places to hide
the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task
I think I understand what you're saying, but I want to double check and try laying it out explicitly.
I think I agree with both of these points, but here's my thinking for why I still expect to see this phenomena (and why the article above was simplified to just be "human")
I think this is worth considering even in cases of things like recursive prompting for a couple of reasons that I haven't organized:
Putting what I think is the most important part of my reply first: I think research into mitigations is premature and instead demonstrating/measuring the phenomena should take priority.
However given that, I think I agree that these are all possible mitigations to the phenomena, in particular (rephrasing your points):
Agree that founders are a bit of an exception. Actually that's a bit in the longer version of this when I talk about it in person.
Basically: "The only people who at the very top of large tech companies are either founders or those who were able to climb to the tops of moral mazes".
So my strategic corollary to this is that it's probably weakly better for AI alignment for founders to be in charge of companies longer, and to get replaced less often.
In the case of facebook, even in the face of all of their history of actions, I think on the margin I'd pr...
I think there should be a norm about adding the big-bench canary string to any document describing AI evaluations in detail, where you wouldn't want it to be inside that AI's training data.
Maybe in the future we'll have a better tag for "dont train on me", but for now the big bench canary string is the best we have.
This is in addition to things like "maybe don't post it to the public internet" or "maybe don't link to it from public posts" or other ways of ensuring it doesn't end up in training corpora.
I think this is a situation for defense-in-depth.
More Ideas or More Consensus?
I think one aspect you can examine about a scientific field is it's "spread"-ness of ideas and resources.
High energy particle physics is an interesting extrema here -- there's broad agreement in the field about building higher energy accelerators, and this means there can be lots of consensus about supporting a shared collaborative high energy accelerator.
I think a feature of mature scientific fields that "more consensus" can unlock more progress. Perhaps if there had been more consensus, the otherwise ill-fated supercond...
AGI will probably be deployed by a Moral Maze
Moral Mazes is my favorite management book ever, because instead of "how to be a good manager" it's about "empirical observations of large-scale organizational dynamics involving management".
I wish someone would write an updated version -- a lot has changed (though a lot has stayed the same) since the research for the book was done in the early 1980s.
My take (and the author's take) is that any company of nontrivial size begins to take on the characteristics of a moral maze. It seems to be a pretty good nul...
(Caveat: I ran the first big code scrape and worked on the code generating models which later became codex.)
My one line response: I think opt-out is obviously useful and good and should happen.
AFAIK there are various orgs/bodies working on this but kinda blanking what/where. (In particular there's a FOSS mailing list that's been discussing how ML training relates to FOSS license rights that seems relevant)
Opt-out strings exist today, in an insufficient form. The most well known and well respected one is probably the big-bench canary string: htt...
Sometimes I get asked by intelligent people I trust in other fields, "what's up with AI x risk?" -- and I think at least part of it unpacks to this: Why don't more people believe in / take seriously AI x-risk?
I think that is actually a pretty reasonable question. I think two follow-ups are worthwhile and I don't know of good citations / don't know if they exist:
Thanks so much for making this!
I'm hopeful this sort of dataset will grow over time as new sources come about.
In particular, I'd nominate adding MLSN (https://www.alignmentforum.org/posts/R39tGLeETfCZJ4FoE/mlsn-4-many-new-interpretability-papers-virtual-logit) to the list of newsletters in the future.
This seems like an overly alarmist take on what is a pretty old trend of research. Six years ago there was a number of universities working on similar models for the VizDoom competition (IIRC they were won by Intel and Facebook). It seems good to track this kind of research, but IMO the conclusions here are not supported at all by the evidence presented.
Longtermist X-Risk Cases for working in Semiconductor Manufacturing
Two separate pitches for jobs/roles in semiconductor manufacturing for people who are primarily interested in x-risk reduction.
Securing Semiconductor Supply Chains
This is basically the "computer security for x-risk reduction" argument applied to semiconductor manufacturing.
Briefly restating: it seems exceedingly likely that technologies crucial to x-risks are on computers or connected to computers. Improving computer security increases the likelihood that those machines are not stolen...
I think your explanation of legibility here is basically what I have in mind, excepting that if it's human designed it's potentially not all encompassing. (For example, a world model that knows very little, but knows how to search for information in a library)
I think interpretability is usually a bit more narrow, and refers to developing an understanding of an illegible system. My take is that it is not "interpretability" to understand a legible system, but maybe I'm using the term differently than others here. This is why I don't think "...
Two Graphs for why Agent Foundations is Important (according to me)
Epistemic Signpost: These are high-level abstract reasons, and I don’t go into precise detail or gears-level models. The lack of rigor is why I’m short form-ing this.
First Graph: Agent Foundations as Aligned P2B Fixpoint
P2B (a recursive acronym for Plan to P2B Better) is a framing of agency as a recursively self-reinforcing process. It resembles an abstracted version of recursive self improvement, which also incorporates recursive empowering and recursive resource gathering. &nb...
Maybe useful: an analogy this post brought to mind for me: Replacing “AI” with “Animals”.
Hypothetical alien civilization, observing Early Earth and commenting on whether it poses a risk.
Doesn’t optimization nature produce non-agentic animals? It mostly does, but those aren’t the ones we’re concerned with. The risk is all concentrated in the agentic animals.
Basically every animal ever is not agentic. I’ve studied animals for my entire career and I haven’t found an agentic animal yet. That doesn’t preclude them showing up in the futur...
Hacking the Transformer Prior
Neural Network Priors
I spend a bunch of time thinking about the alignment of the neural network prior for various architectures of neural networks that we expect to see in the future.
Whatever alignment failures are highly likely under the neural network prior are probably worth a lot of research attention.
Separately, it would be good to figure out knobs/levers for changing the prior distribution to be more aligned (or produce more aligned models). This includes producing more interpretable models.
Analogy to Software Devel...
I think there’s a lot going on with your equivocating the speed prior over circuits w/ a speed prior over programs.
I think a lot of the ideas in this direction are either confused by the difference between circuit priors and program priors, or at least treating them as equivalent. Unfortunately a lot of this is vague until you start specifying the domain of model. I think specifying this more clearly will help communicating about these ideas. To start with this myself, when I talk about circuit induction, I’m talking about things th...
Interpretability Challenges
Inspired by a friend I've been thinking about how to launch/run interpretability competitions, and what the costs/benefits would be.
I like this idea a lot because it cuts directly at one of the hard problems of spinning up in interpretability research as a new person. The field is difficult and the objectives are vaguely defined; it's easy to accidentally trick yourself into seeing signal in noise, and there's never certainty that the thing you're looking for is actually there.
On the other hand, most of the interpretability...
My Cyberwarfare Concerns: A disorganized and incomplete list
I with more of the language alignment research folks were looking into how current proposals for aligning transformers end up working on S4 models.
(I am one of said folks so maybe hypocritical to not work on it)
In particular it seems like there's way in which it would be more interpretable than transformers:
I work on this sort of thing at OpenAI.
I think alignment datasets are a very useful part of a portfolio approach to alignment research. Right now I think there are alignment risks/concerns for which datasets like this wouldn't help, but also there are some that it would help for.
Datasets and benchmarks more broadly are useful for forecasting progress, but this assumes smooth/continuous progress (in general a good assumption -- but also good to be wary of cases where this isn't the case).
Some thoughts from working on generating datasets for research, ...
I worry a little bit about this == techniques which let you hide circuits in neural networks. These "hiding techniques" are a riposte to techniques based on modularity or clusterability -- techniques that explore naturally emergent patterns.[1] In a world where we use alignment techniques that rely on internal circuitry being naturally modular, trojan horse networks can avoid various alignment techniques.
I expect this to happen by default for a bunch of reasons. An easy one to point to is the "free software" + "crypto anarchist" + "fuck y...
Just copy-pasting the section
...We believe that Transformative Artificial Intelligence (TAI) [Karnofsky et al., 2016] is approaching [Cotra, 2020, Grace et al., 2018], and that these systems will cause catastrophic damage if they are misaligned with human values [Fox and Shulman, 2013, Omohundro, 2008]. As such, we believe it is essential to prioritize and help facilitate technical research that ensures TAI’s values will be aligned with ours.
AI Alignment generally refers to the problem of how to ensure increasingly powerful and autonomous AI systems per
It's worth probably going through the current deep learning theories that propose parts of gears-level models, and see how they fit with this. The first one that comes to mind is the Lottery Ticket Hypothesis. It seems intuitive to me that certain tasks correspond to some "tickets" that are harder to find.
I like the taxonomy in the Viering and Loog, and it links to a bunch of other interesting approaches.
This paper shows phase transitions in data quality as opposed to data size, which is an angle I hadn't considered before.
There's the google pa...
Decomposing Negotiating Value Alignment between multiple agents
Let's say we want two agents to come to agreement on living with each other. This seems pretty complex to specify; they agree to take each other's values into account (somewhat), not destroy each other (with some level of confidence), etc.
Neither initially has total dominance over the other. (This implies that neither is corrigible to the other)
A good first step for these agents is to share each's values with the other. While this could be intractably complex -- it's probably ...
I'm really excited about this research direction. It seems so well-fit to what you've been researching in that past -- so much so that it doesn't seem to be a new research direction so much as a clarification of the direction you were already pursuing.
I think producing a mostly-coherent and somewhat-nuanced generalized theory of alignment would be incredibly valuable to me (and I would consider myself someone working on prosaic alignment strategies).
A common thread in the last year of my work on alignment is something like "How can I be an aligned intellig...
Should this other post be a separate linkpost for this? https://www.furidamu.org/blog/2022/02/02/competitive-programming-with-alphacode/#fnref:2
Feels like it covers the same, but is a personal description by an author, rather than the deepmind presser.
I think that's right that upgraded verification by itself is insufficient for 'defense wins' worlds. I guess I'd thought that was apparent but you're right it's definitely worth saying explicitly.
A big wish of mine is that we end up doing more planning/thinking-things-through for how researchers working on AI today could contribute to 'defense wins' progress.
My implicit other take here that wasn't said out loud is that I don't really know of other pathways where good theorem proving translates to better AI x-risk outcomes. I'd be eager to know of these.
Thoughts:
First, it seems worthwhile to try taboo-ing the word 'deception' and see whether the process of building precision to re-define it clears up some of the confusion. In particular, it seems like there's some implicit theory-of-mind stuff going on in the post and in some of the comments. I'm interested if you think the concept of 'deception' in this post only holds when there is implicit theory-of-mind going on, or otherwise.
As a thought experiment for a non-theory-of-mind example, let's say the daemon doesn't really understand why it get...
Copying some brief thoughts on what I think about working on automated theorem proving relating to working on aligned AGI:
FWIW I think this is basically right in pointing out that there's a bunch of errors in reasoning when people claim a large deep neural network "knows" something or that it "doesn't know" something.
I think this exhibits another issue, though, by strongly changing the contextual prefix, you've confounded it in a bunch of ways that are worth explicitly pointing out:
Quite a lot of scams involve money that is fake. This seems like another reasonable conclusion.
Like, every time I simulate myself in this sort of experience, almost all of the prior is dominated by "you're lying".
I have spent an unreasonable (and yet unsuccessful) amount of time trying to sketch out how to present omega-like simulations to my friends.
Giving Newcomb's Problem to Infosec Nerds
Newcomb-like problems are pretty common thought experiments here, but I haven't seen a bunch of my favorite reactions I've got when discussing it in person with people. Here's a disorganized collection:
Adding a comment instead of another top-level post saying basically the same thing. Add my thoughts, on things I liked about this plan:
It's centered on people. A lot of rationality is thinking and deciding and weighing and valuing possible actions. Another frame that is occasionally good (for me at least) is "How would <my hero> act?" -- and this can help guide my actions. It's nice to have a human or historical action to think about instead of just a vague virtue or principle.
It encourages looking through history for events o...
Ideas myself and others have had, off the top of my head;
I like pointing out this confusion. Here's a grab-bag of some of the things I use it for, to try to pull them apart:
- actors/institutions far away from the compute frontier produce breakthroughs in AI/AGI tech (juxtaposing "only the top 100 labs" vs "a couple hackers in a garage")
- once a sufficient AI/AGI capability is reached, that it will be quickly optimized to use much less compute
- amount of "optimization pressure" (in terms of research effort) pursuing AI/AGI tech, and the likelihood that they missed low-hanging fruit
- how far AI/AGI research/products
... (read more)