Just like language models are trained using masked language modelling and next token prediction, Dall-E was trained for image inpainting(predicting cropped-out parts of an image). This doesn't require explicit labels; hence it's self-supervised learning. Note this is only a part of the training procedure, which is self-supervised and not the whole training process.

-1

0