"One might ask why this prior model is necessary: since the CLIP text encoder is trained to match the output of the image encoder, why not use the output of the text encoder as the “gist” of the image? The answer is that an infinite number of images could be consistent with a given caption, so the outputs of the two encoders will not perfectly coincide. Hence, a separate prior model is needed to “translate” the text embedding into an image embedding that could plausibly match it."
A co-creator of DALL-E 2 wrote this blog post: http://adityaramesh.com/posts/dalle2/dalle2.html .
From that blog post:
"One might ask why this prior model is necessary: since the CLIP text encoder is trained to match the output of the image encoder, why not use the output of the text encoder as the “gist” of the image? The answer is that an infinite number of images could be consistent with a given caption, so the outputs of the two encoders will not perfectly coincide. Hence, a separate prior model is needed to “translate” the text embedding into an image embedding that could plausibly match it."