Comparing AI Safety-Capabilities Dilemmas to Jervis' Cooperation Under the Security Dilemma
I've been skimming some things about the Security Dilemma (specifically Offense-Defense Theory) while looking for analogies for strategic dilemmas in the AI landscape.
I want to describe a simple comparison here, lightly held (and only lightly studied)
Now re-creating the two values (each) of the two variables of Offense-Defense Theory:
Finally, we can sketch out the Four Worlds of Offense-Defense Theory
From the paper linked in the title:
My Overall Take
This one is weird, and hard for me to make convincing stories about the real version of this, as opposed to seeming to have a Safety posture -- things like "recruiting", "public support", "partnerships", etc all can come from merely seeming to adopt the Safety Posture. (Though, this is actually a feature of the security dilemma and offense-defense theory, too)
I largely agree with the above, but commenting with my own version.
What I think companies with AI services should do:
Can be done in under a week:
Longer term
References:
Weakly positive on this one overall. I like Coase's theory of the firm, and like making analogies with it to other things. I don't think this application felt like it quite worked to me, and trying to write up why.
One thing is I think feels off is an incomplete understanding of the Coase paper. What I think the article gets correct: Coase looks at the difference between markets (economists preferred efficient mechanism) and firms / corporation, and observes that transaction costs (for people these would be contracts, but in general all transaction costs are included) are avoided in firms. What I think it misses: A primary question explored in the paper is what factors govern the size of firms, and this leads to a mechanistic model that the transaction costs internal to the firm increase with the size of the firm until they reach a limit of the same as transaction costs for the open market (and thus the expected maximum efficient size of a non-monopoly firm). A second, smaller, missed point I think is that the price mechanism works for transactions outside the firm, but does not for transactions inside the firm.
Given these, I think the metaphor presented here seems incomplete. It's drawing connections to some of the parts of the paper, but not all of the central parts, and not enough to connect to the central question of size.
I'm confused exactly what parts of the metaphor map to the paper's concept of market and firm. Is monogamy the market, since it doesn't require high-order coordination? Is polyamory the market since everyone can be a free-ish actor in an unbundled way? Is monogamy the firm since it's not using price-like mechanisms to negotiate individual unbundled goods? Is polyamory the firm since its subject to the transaction cost scaling limit of size?
I do think that it seems to use the 'transaction costs matter' pretty solidly from the paper, so there is that bit.
I don't really have much I can say about the polyamory bits outside of the economics bits.
This post was personally meaningful to me, and I'll try to cover that in my review while still analyzing it in the context of lesswrong articles.
I don't have much to add about the 'history of rationality' or the description of interactions of specific people.
Most of my value from this post wasn't directly from the content, but how the content connected to things outside of rationality and lesswrong. So, basically, i loved the citations.
Lesswrong is very dense in self-links and self-citations, and to a lesser degree does still have a good number of links to other websites.
However it has a dearth of connections to things that aren't blog posts -- books, essays from before the internet, etc. Especially older writings.
I found this posts citation section to be a treasure trove of things I might not have found otherwise.
I have picked up and skimmed/started at least a dozen of the books on the list.
I still come back to this list sometimes when I'm looking for older books to read.
I really want more things like this on lesswrong.
I read this sequence and then went through the whole thing. Without this sequence I'd probably still be procrastinating / putting it off. I think everything else I could write in review is less important than how directly this impacted me.
Still, a review: (of the whole sequence, not just this post)
First off, it signposts well what it is and who it's for. I really appreciate when posts do that, and this clearly gives the top level focus and whats in/out.
This sequence is "How to do a thing" - a pretty big thing, with a lot of steps and branches, but a single thing with a clear goal.
The post is addressing a real need in the community (and it was a personal need for me as well) -- which I think are the best kinds of "how to do a thing" posts.
It was detailed and informative while still keeping the individual points brief and organized.
It specifically calls out decision points and options, how much they matter, what the choices are, and information relevant to choosing. This is a huge energy-saver in terms of actually getting people to do this process.
When I went through it, it was accurate, and I ran into the decision points and choices as expected.
Extra appreciation for the first post which also includes a concrete call to action for a smaller/achievable-right-now thing for people to do (sign a declaration of intent to be cryopreserved). Which I did! I also think that a "thing you can do right now" is a great feature to have in "how to do a thing" posts.
I'm in the USA, so I don't have much evaluation or feedback on how valuable this is to non-USA folks. I really do appreciate that a bunch of extra information was added for non-USA cases, and it's organized such that it's easy to read/skim past if not needed.
I know that this caused me personally to sign up for cryonics, and I hope others as well. Inasmuch as the authors goal was for more people in our community to sign up for cryonics -- I think that's a great goal and I think they succeeded.
Summary
Definitions
Key ideas
Review Summary
Review
Overall I think I agree with the observations and concepts presented, as well as the frustration/exasperation sense at the way it seems we're collectively doing it wrong.
However I think this piece feels incomplete to me in a number of ways, and I'll try to point it out by giving examples of things that would make it feel complete to me.
One thing that would make it feel complete to me is a better organized set of definitions/taxonomy around the key ideas. I think 'politics' can be split into object level things around politicians vs policies. I think even the 'object level' can be split into things like actions (vote for person X or not) vs modeling (what is predicted impact on Y). When I try to do this kind of detail-generation, I think I find that my desire for object-level is actually a desire for specific kinds of object level focus (and not object-level in the generic).
Another way of making things more precise is to try to make some kind of measure or metric out of the meta<->object dimension. Questions like 'how would it be measured' or even 'what are the units of measurement' would be great for building intuitions and models around this. Relatedly - describing what 'critical meta-ness' or the correct balance point would also be useful here. Assuming we had a way to decrease-meta/increase-object, how would we know when to stop? I think these are the sorts of things that would make a more complete meta-object theory.
A gears-level model of what's going on would also make this feel complete to me. Here's me trying to come up with one on the spot:
Discourse around politics is a pretty charged and emotionally difficult thing for most people, in ways that can subvert our normally-functioning mechanisms of updating our beliefs. When we encounter object-level evidence that contradicts our beliefs, we feel a negative emotion/experience (in a quick/flash). One palliative to this feeling is to "go meta" - hop up the ladder of abstraction to a place where the belief is not in danger. We habituate ourselves to it by seeing others do similarly and imitation is enough to propagate this without anyone doing it intentionally. This model implicitly makes predictions about how to make spaces/contexts where more object level discussions happen (less negative experience, more emotional safety) as well as what kinds of internal changes would facilitate more object level discussions (train people to notice these fast emotional reactions and their corresponding mental moves).
Another thing that would make this article feel 'complete' to me would be to compare the 'politics' domain to other domains familiar to folks here on lesswrong (candidates are: effective altruism, AI alignment, rationality, etc). Is the other domain too meta in the way politics is? Is it too object level? It seems like the downsides (insufficient exploration, self-censorship, distraction) could apply to a much bigger domain of thought.
Thoughts, mostly on an alternative set of next experiments:
I find interpolations of effects to be a more intuitive way to study treatment effects, especially if I can modulate the treatment down to zero in a way that smoothly and predictably approaches the null case. It's not exactly clear to me what the "nothing going on case is", but here's some possible experiments to interpolate between it and your treatment case.
I mostly say all this because I think it's hard to evaluate "something is up" (predictions dont match empirical results) in ML that look like single experiments or A-B tests. It's too easy (IMO) to get bugs/etc. Smoothly interpolating effects, with one side as a well established null case / prior case, and another different case; which vary smoothly with whatever treatment, is IMO strong evidence that "something is up".
Hope there's something in those that's interesting and/or useful. If you haven't already, I strongly recommend checking out the intrinsic dimensionality paper -- you might get some mileage by swapping your cutoff point for their phase change measurement point.
I think at this point these feel like empirical questions, which I think would be much more clearly answered by demonstrations or experiments.
Trying to encode an additional penalty on changing non-semantic information is an interesting idea.
However I think you're missing that you don't have the ability to directly compare to a reference LM in cases where you're training to improve on some performance benchmark. During training the model will change its predictions on everything to some degree (both semantic and nonsemantic content).
So your proposed detection will always show some amount of disagreement between the prior and the trained model on weird grammatical patterns as well as conceptual tokens. The question is: "is the difference merely due to the changes to improve performance, or is it also transmitting hidden information"
If what you’re saying is “any change to the distribution will change KL” — I think that’s just correct.
This also applies to changes during training where the model is learning to perform better on the objective task.
So we are expecting some amount of KL divergence already.
My claims are:
The cheapest place to hide information (due to KL) are places where the model already has high entropy (ie it is uncertain between many possible outputs)
optimization pressure will try to push this extra information into the cheapest places to hide
the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task
I like pointing out this confusion. Here's a grab-bag of some of the things I use it for, to try to pull them apart:
probably also others im forgetting