Multimodal Patch Embeddings
Implementation available here: https://github.com/TinyVolt/multimodal-patch-embeddings
The output image embedding of the CLIP ViT is multimodal - it can be compared with the text embeddings. But the patch embeddings are not multimodal. This project introduces a small change in the distillation of CLIP ViT which leads to patch embeddings that are multimodal. This allows one to compare a text vector not just with the image vector but also with the patch vectors.
How to get multimodal patch embeddings
The main idea here is to think of the final image embedding as a convex sum of points (aka the patch embeddings) on a hypersphere. To do so, the patch embeddings are normalized to have the same norm in the final layer. But this change is not enough to make the patch embeddings multimodal. I tried this idea out and it did not work.
What worked was adding a mask to the transformer layers. By limiting each token to attend to only its neighbors, the patch embeddings became multimodal. This is because the patch embeddings now had information about the local structure of the image.
Results
Patch activations for the terms: dog, shadow, legs, waves, sand
Patch activations for the terms: promotion, cost, oranges, number, fruits, bananas
Patch activations for the terms: sky, sun, sand, water, clouds, surfing
Because each patch can only attend to its neighbors, the performance of this model takes a hit. But with this compromise we get patch embeddings that are multimodal. This opens up possibilities like achieving open set detection without the need for cross attention between the modalities, interpretability of the model, and more.