This is a linkpost for https://distill.pub/2021/multimodal-neurons/
Their faces and poses images are solid gold. There should be Gravatars except random points in CLIP's face latent space.
They also document the fact that the multimodality allows for "typographic attacks", where labeling an item with a particular text causes the network to misclassify the item as an instance of the text.
"These are not the droids you are looking for."
A paper investigating how individual neurons in a CLIP model (an image/text neural net combining a ResNet vision model with a Transformer language model) respond to various abstract concepts. This shouldn't be very surprising after GPT-3 and DALL-E but still, identifying multimodal neurons feels scarily close to "neural net that understands abstract concepts" and thus AGI for my comfort.
Some individual neurons that they isolated (see the article for more):
They also document the fact that the multimodality allows for "typographic attacks", where labeling an item with a particular text causes the network to misclassify the item as an instance of the text.