Here are the problems I found (so far) with this approach and my attempt at solutions:
Problem 1: We won't think of all possibilities for bad stuff happening, and we might miss a crucial failure mode in our text corpus.
Solution 1: We might try to outsource the creation of some of those possibilities to GPT-N. This would be a much bigger problem if GPT were explicitly optimizing against us, but it isn't.
Problem 2: GPT-N might predict that the world ends in such a way that stuff like the headlines of newspapers don't change, or that things go badly rapidly enough that humanity doesn't have time to notice it.
Solution 1: we can ask GPT-N for the probabilities at fine-grained time slices, and of text modalities that respond quickly to events, like tweets or forum posts.
Solution 2: for the case where things go badly rapidly enough that we don't notice it at all, we can ask GPT-N for the probability of pictures (say a picture of downtown New York, still intact with living humans). Pictures respond faster to events in the real world. The only way to beat this is with something like a complete takeover of all forms of electronic communication at the same time. Or with total complete destruction of everything at the same time, ideally we need to have one of the cameras up on a satellite, taking pictures of the earth, so it wouldn't be immediately destroyed.
Solution 3: we need to sign all the items in our dataset with cryptographic keys stored on computers in Faraday cages. A Twitter account which publishes signed messages/pictures every day or so should be made by organizations and serve as "dead man indicators". This ensures that a potential fast doom scenario cannot only be electronic, it needs to be physical.
Problem 3: The optimization pressure scales with the number of questions we ask of GPT-N, especially if the questions we ask are informed by GPT's past refusals or answers. If we get to a situation where GPT refuses to answer all our questions, and we keep asking them, then we are optimizing against the safety conditions.
Solution 1: this means that pretty much the first thing we need to ask GPT-N is how to safely extend the number of questions we can ask.
To clarify definitions a bit, by GPT-N I mean a future deep learning model trained to minimize the error of next-word prediction on a corpus that includes substantial portions of the internet. This model is assumed to be superhuman at its given task. There are only 2 requirements on the model necessary for the trick to work: #1. Needs to be trained with something like gradient descent, or otherwise have a clear mean of learning from a single example. #2. Needs to predict the probability of arbitrary text samples.
The general idea is the following: you can make GPT-N model humanity having access to its outputs by adding its outputs to its dataset and training on them. This allows us to use GPT-N's immense knowledge to predict the consequences of us having its output to a question. A practical process for using this to make safer GPT-N is something like this:
Setup the API to the model such that for every query you want to make of the AI, you make GPT-N predict a continuation without showing you, then you train it on that continuation (basically making it think that what it produced was already available to humanity). Then you ask it to predict the likelihood of various doom-implying headlines from a major newspaper in the 2040s, if training on the continuation to your query makes the likelihood of doom headlines greater, then the AI doesnt publish the continuation to your question.
The question of ensuring safety is then rephrased to building a corpus of text that we expect to reliably indicate positive vs negative outcomes of humanity. If we have such a corpus, we can have a system that causes GPT-N to only show answers that cause the good outcomes and avoid the bad ones, crucially without it trying to cause either, no direct optimization pressure is being applied to causing these outcomes, it's still producing the most likely text continuation, but we're using its knowledge to know whether we should look at its answers. There is some optimization pressure being applied to steering the future: we still have the at most 1bit/question decision of showing vs not-showing an answer, and that is quite unavoidable if we want to actually have any impact on the future.
This is an extremely cheap thing to implement, no additional fancy algorithms or breakthroughs required, so AI leaders get no measurable capability penalty from using this. The corpus of text indicating good/bad outcomes should be made publicly available and as large as possible. However it does raise the compute cost of inference, since we need to evaluate the model on the whole good/bad corpus for each question we want to ask.
Of course, this is just a patch, and we should use the immense tool of GPT-N to look for proper safe AGI (by making it predict future papers in AI safety), but this trick might help us navigate the short hazardous period before we have such a thing.