The hard/useful parts of alignment research are largely about understanding agency/intelligence/etc. That sort of understanding naturally yields capabilities-relevant insights. So, alignment researchers naturally run into decisions about how private to keep their work.
This post is a bunch of models which I use to think about that decision.
I am not very confident that my thinking on the matter is very good; in general I do not much trust my own judgment on security matters. I’d be more-than-usually interested to hear others’ thoughts/critiques.
The “Nobody Cares” Model
By default, nobody cares. Memetic reproduction rate is less than 1. Median number of citations (not counting self-citations) is approximately zero, and most citations are from people who didn’t actually read the whole paper but just noticed that it’s vaguely related to their own work. Median number of people who will actually go to any effort whatsoever to use your thing is zero. Getting other people to notice your work at all takes significant effort and is hard even when the work is pretty good. “Nobody cares” is a very strong default, the very large majority of the time.
Privacy, under this model, is very easy. You need to make a very large active effort in order for your research to not end up de-facto limited to you and maybe a few friends/coworkers.
This is the de-facto mechanism by which most theoretical work on alignment avoids advancing capabilities most of the time, and I think it is the main mechanism by which most theoretical work on alignment should aim to avoid advancing capabilities most of the time. It should be the default. But obviously there will be exceptions; when does the “nobody cares” model fail?
Theory-Practice Gap and Flashy Demos
Why, as a general rule, does nobody care? In particular, why does nobody working on AI capabilities care about most work on alignment theory most of the time, given that a lot of it is capabilities-relevant?
Well, even ignoring the (large) chunk of theoretical research which turns out to be useless, the theory-practice gap is a thing. Most theoretical ideas don’t really do much when you translate them to practice. This includes most ideas which sound good to intelligent people. Even those theoretical ideas which do turn out to be useful are typically quite hard to translate into practice. It takes months of work (at least), often additional significant insights, and often additional enabling pieces which aren’t already mainstream or even extant. Practitioners correctly expect this, and therefore mostly don’t pay attention to most ideas until after there’s evidence that they work in practice. (This is especially true of the sort of people who work on ML systems.)
In ML/AI, smart-sounding ideas which don’t really work easily are especially abundant, so ML practitioners are (correctly) even more than usually likely to ignore theoretical work.
The flip side of this model is that people will pay lots of attention once there is clear evidence that some idea works in practice - i.e. evidence that the idea has crossed the theory-practice gap. What does that look like? Flashy demos. Flashy demos are the main signal that the theory-practice gap has already been crossed, which people correctly take to mean that the thing can be useful now.
The theory-practice gap is therefore a defense which both (a) slows down someone actively trying to apply an idea, and (b) makes most ideas very-low-memetic-fitness until they have a flashy demo. To a large extent, one can write freely in public without any flashy demos, and it won’t spread very far memetically (or will spread very slowly if it does).
Reputation
Aside from flashy demos, the other main factor I know of which can draw peoples’ attention is reputation. If someone has a track record of interesting work, high status, or previous flashy demos, then people are more likely to pay attention to their theoretical ideas even before the theory-practice gap is crossed.
Of course this is not relevant to the large majority of people the large majority of time, especially insofar as it involves reputation outside of the alignment research community. That said, if you’re relying on lack-of-reputation for privacy, then you need to avoid gaining too broad a following in the future, which may be an important constraint - more on that in the next section.
Takeaways & Gotchas
Main takeaway of the “nobody cares” model: if you’re not already a person of broad interest outside alignment, and you don’t make any flashy demos, then probably approximately nobody working on ML systems outside of alignment will pay any attention to your work.
… but there are some gotchas.
First, there’s a commitment/time-consistency problem: to the extent that we rely on this model of privacy, we need to precommit to remain uninteresting in the future, at least until we’re confident that our earlier work won’t dangerously accelerate capabilities. If you’re hoping to gain lots of status outside the alignment research community, that won’t play well with a “nobody cares” privacy model. If you’re hoping to show future flashy demos, that won’t play well with a “nobody cares” privacy model. If your future work is very visibly interesting, you may be stuck keeping it secret.
(Though note that, in the vast majority of cases, it will turn out that your earlier theory work was never particularly important for capabilities in the first place, and hopefully you figure that out later. So relying on “nobody caring” now will reduce your later options mainly in worlds where your current work turns out to be unusually important/interesting in its own right.)
Second, relying on “nobody caring” obviously does not yield much defense-in-depth. It’s probably not something we want to rely on for stuff that immediately or directly advances capabilities by a lot.
But for most theoretical alignment work most of the time, where there are some capabilities implications but they’re not very direct or immediately dangerous on their own, I think “nobody cares” is the right privacy model under which to operate. Mostly, theoretical researchers should just not worry much about privacy, as long as (1) they don’t publish flashy demos, (2) they don’t have much name recognition outside alignment, and (3) the things they’re working on won’t immediately or directly advance capabilities by a lot.
Beyond “Nobody Cares”: Danger, Secrecy and Adversaries
Broadly speaking, I see two main categories of reasons for theoretical researchers to go beyond the “nobody cares” model and start to actually think about privacy:
- Research which might directly or immediately advance capabilities significantly
- Current or anticipated future work which is unusually likely to draw a lot of attention, especially outside the alignment field
These are different failure modes of the “nobody cares” model, and they call for different responses.
The “Keep It To Yourself” Model for Immediately Capabilities-Relevant Research
Under the “nobody cares” model, a small number of people might occasionally pay attention to your research and try to use it, but your research is not memetically fit enough to spread much. For research which might directly or immediately advance capabilities significantly, even a handful of people trying it out is potentially problematic. Those handful might realize there’s a big capability gain and run off to produce a flashy demo.
For research which is directly or immediately capabilities-relevant, we want zero people to publicly try it. The “nobody cares” model is not good enough to robustly achieve that. In these cases, my general policy would be to not publish the research, and possibly not share it with anyone else at all (depending on just how immediately and directly capabilities-relevant it looks).
On the other hand, we don’t necessarily need to be super paranoid about it. In this model, we’re still mostly worried about the research contributing marginally to capabilities; we don’t expect it to immediately produce a full-blown strong AGI. We want to avoid the work spreading publicly, but it’s still not that big a problem if e.g. some government surveillance sees my google docs. Spy agencies, after all, would presumably not publicly share my secrets after stealing them.
The “Active Adversary” Model
… which brings us to the really paranoid end of the spectrum. Under this model, we want to be secure even against active adversaries trying to gain access to our research - e.g. government spy agencies.
I’m not going to give advice about how to achieve this level of security, because I don’t think I’m very good at this kind of paranoia. The main question I’ll focus on is: when do we need highly paranoid levels of security, and when can we get away with less?
As with the other models, someone has to pay attention in order for security to be necessary at all. Even if a government spy agency had a world-class ML research lab (which I doubt is currently the case), they’d presumably ignore most research for the same reasons other ML researchers do. Also, spying is presumably expensive; random theorists/scientists are presumably not worth the cost of having a human examine their work. The sorts of things which I’d expect to draw attention are the same as earlier:
- enough of a track record that someone might actually go to the trouble of spying on our work
- public demonstration of impressive capabilities, or use of impressive capabilities in a way which will likely be noticed (e.g. stock trading)
Even if we are worried about attention from spies, that still doesn’t mean that most of our work needs high levels of paranoia. The sort of groups who are likely to steal information not meant to be public are not themselves very likely to make that information public. (Well, assuming our dry technical research doesn’t draw the attention of the dreaded Investigative Journalists.) So unless we’re worried that our research will accelerate capabilities to such a dramatic extent that it would enable some government agency to develop dangerous AGI themselves, we probably don’t need to worry about the spies.
The case where we need extreme paranoia is where both (1) an adversary is plausibly likely to pay attention, and (2) our research might allow for immediate and direct and very large capability gains, without any significant theory-practice gap.
This degree of secrecy should hopefully not be needed very often.
Other Considerations
Unilateralist’s Curse
Many people may have the same idea, and it only takes one of them to share it. If all their estimates of the riskiness of the idea have some noise in them, and their risk tolerance has some noise, then presumably it will be the person with unusually low risk estimate and unusually high risk tolerance who determines whether the idea is shared.
In general, this sort of thing creates a bias toward unilateral actions being taken even when most people want them to not be taken.
On the other hand, unilateralist's curse is only relevant to an idea which many people have. And if many people have the idea already, then it's probably not something which can realistically stay secret for very long anyway.
Existing Ideas
In general, if an idea has been talked-about in public at some previous point, then it’s probably fine to talk about again. Your marginal impact on memetic fitness is unlikely to be very large, and if the idea hasn’t already taken off then that’s strong evidence that it isn’t too memetically fit. (Though this does not apply if you are a person with a very large following.)
Alignment Researchers as the Threat
Just because someone’s part of the ingroup does not mean that they won’t push the run button. We don’t have a way to distinguish safe from dangerous programs; our ingroup is not meaningfully more able to do so than the outgroup, and few people in the ingroup are very careful about running python scripts on a day-to-day basis. (I’m certainly not!)
Point is: don’t just assume that it’s fine to share ideas with everyone in the ingroup.
On the other hand, if all we want is for an idea to not spread publicly, then in-group trust is less risky, because group members would burn their reputation by sharing private things.
Differential Alignment/Capabilities Advantage
In the large majority of cases, research is obviously much more relevant to one or the other, and desired privacy levels should be chosen based on that.
I don’t think it’s very productive, in practice, to play the “but it could be relevant to [alignment/capabilities] via [XYZ]” game for things which seem obviously more relevant to capabilities/alignment.
Most Secrecy Is Hopefully Temporary
Most ideas will not dramatically impact capabilities. Usually, we should expect secrecy to be temporary, long enough to check whether a potentially-capabilities-relevant idea is actually short-term relevant (i.e. test it on some limited use-case).
Feedback Please!
Part of the reason I’m posting this is because I have not seen discussion of the topic which feels adequate. I don’t think my own thoughts are clearly correct. So, please argue about it!
Okay, I've thought about it more, and I think my concerns are mainly outlined by this. Less by the post's actual contents, and more by the post's existence.
People dislike villains. Whether the concerns Andrew outlines are valid or not, people on the outside will tend to think that such concerns are valid. The hypothetical unilateral-aligned-AGI organization will be, at all times, on the verge of being a target of the entire world. The public would rally against it if the organizations' intentions became public knowledge, other AI Labs would be eager to get rid of the competition slash threat it presents, and governments would be eager either to seize AI research (if they take AI seriously by that point) or acquire political points by squishing something the public and megacorps want squished.
As such, the unilateral path requires a lot of subtle secrecy too. It should not be known that we expect our AI to engage in, uh, full-scale world... optimization. In theory, that connection can be left obscured — most of the people involved can just be allowed to fail to think about what the aligned superintelligence will do once it's deployed, so there aren't leaks from low-commitment people joining and quitting the org. But the people in charge will probably have the full picture, and... Well, at this point it sounds like the stupid kind of supervillain doomsday scheme, no?
More practically, I think the ship has already sailed on keeping the sort of secrecy this plan would need to work. I don't understand why all this talk of pivotal acts has been allowed to enter public discourse by Eliezer et al., but it'll be doubtlessly connected to any hypothetical future friendly-AGI org. Probably not by the public/other AI labs directly, but by fellow AI Safety researches who do not agree with unilateral pivotal acts. And once the concerns have been signal-boosted so, then they may be picked up by the media/politicians/Eliezer's sneer club/whoever, and once we're spending billions on training runs and it's clear that there's something actually going on beyond a bunch of doom-cult wackos, they will take these concerns seriously and act on them.
A further contributing factor may be increased public awareness of AI Risk in the future, encouraged by general AI capabilities growth, possible (non-omnicial) AI disasters, and poorly-considered efforts of our own community. (It would be very darkly ironic if AI Safety's efforts to ban dangerous AI research resulted in governments banning AI Safety's own AGI research and no-one else's, so that's probably an attractor in possibility-space because we live in Hell.)
The bottom line is... This idea seems thermonuclear, in the sense that trying it and getting noticed probably completely dooms us on the spot, and it'd be really hard not to get noticed.
(Though I don't really buy the whole "pivotal processes" thing either. We can probably increase the timeline this way, but actually making the world's default systems produce an aligned AI... Nah.)