As someone who has been advocating a lot recently for more emphasis on safety evaluations and testing generally, I here again feel compelled to say that you really ought to include a 'test' phase in-between training and deployment.
Testing
Parameters are fixed.
Failures are fine.
Model is sandboxed.
Fixed and/or dynamic testing distribution. Chosen specifically to be different in important ways from the training distribution.
Furthermore, I think that have sharp distinctions between each phase, and well-sandboxed train and test environments is of critical importance going forward. I think trying to get this to be the case is going to take some work from AI governance and safety-advocate folks. I absolutely don't think we should quietly surrender this point.
I basically agree, ensuring that failures are fine during training would sure be great. (And I also agree that if we have a setting where failure is fine, we want to use that for a bunch of evaluation/red-teaming/...). As two caveats, there are definitely limits to how powerful an AI system you can sandbox IMO, and I'm not sure how feasible sandboxing even weak-ish models is from the governance side (WebGPT-style training just seems really useful).
Yes, I agree that actually getting companies to consistently use the sandboxing is the hardest piece of the puzzle. I think there's some promising advancements making it easier to have the benefits of not-sandboxing despite being in a sandbox. For instance: Using a snapshot of the entire web, hosted on an isolated network https://ai.facebook.com/blog/introducing-sphere-meta-ais-web-scale-corpus-for-better-knowledge-intensive-nlp/
I think this takes away a lot of the disadvantages, especially if you combine this with some amount of simulation of backends to give interactability (not an intractably hard task, but would take some funding and development time).
TL;DR: Training and deployment of ML models differ along several axes, and you can have situations that are like training in some ways but like deployment in others. I think this will become more common in the future, so it's worth distinguishing which properties of training/deployment any given argument relies on.
The training/deployment view on ML
We usually think of the lifecycle of an ML model as a two-phase process: first, you train the model, and then (once you're satisfied with its performance), you deploy it to do some useful task. These two phases differ in several ways:
Exceptions to the training/deployment archetypes
Here's a table summarizing typical properties of the training and deployment phase from the previous section:
Clearly, these don't always hold. Some examples:
All these exceptions can occur in various combinations and to various degrees. That implies two things:
Exceptions will likely become more common and extreme
I'd say that for now, the training/deployment view is usually a pretty good approximation that only requires minor caveats. But I expect that the types of exceptions from the previous section will tend to dominate more and more in the future. For example:
Takeaways
I am skeptical that the training/deployment view is a good one in many discussions related to AI safety. Most importantly, I don't think that all of the danger comes from the deployment phase, nor that distributional shift mostly happens between training and deployment. I'd guess that many (most?) people who are talking about "training" and "deployment" would agree with that, but these claims could easily be seen as implicit assumptions by others, especially newcomers. (As one datapoint, I did not realize some of the things in this post until pretty recently. And I'm still not sure how many people do think that training and deployment will continue being just as valid a framework as they are now).
Less certainly than most of this post, explicitly thinking about the various axes along which training and deployment differ also seems to suggest productive/clear frameworks. (For example, I used to think of distributional shift as a phenomenon that makes sense on its own, whereas I now usually think about the rate of distributional shift over time compared to the rate of updating the parameters/the rate of oversight.)
Thus my suggestion: when you notice yourself thinking or talking about "training vs deployment" of future AI models, try to figure out which specific properties of these phases you're interested in, and phrase your claims or arguments in terms of those.