Multimodal Patch Embeddings

Implementation available here: https://github.com/TinyVolt/multimodal-patch-embeddings

The output image embedding of the CLIP ViT is multimodal - it can be compared with the text embeddings. But the patch embeddings are not multimodal. This project introduces a small change in the distillation of CLIP ViT which leads to patch embeddings that are multimodal. This allows one to compare a text vector not just with the image vector but also with the patch vectors.

How to get multimodal patch embeddings

The main idea here is to think of the final image embedding as a convex sum of points (aka the patch embeddings) on a hypersphere. To do so, the patch embeddings are normalized to have the same norm in the final layer. But this change is not enough to make the patch embeddings multimodal. I tried this idea out and it did not work.

What worked was adding a mask to the transformer layers. By limiting each token to attend to only its neighbors, the patch embeddings became multimodal. This is because the patch embeddings now had information about the local structure of the image.

Results

Patch activations for the terms: dog, shadow, legs, waves, sand

Patch activations for the terms: promotion, cost, oranges, number, fruits, bananas

Patch activations for the terms: sky, sun, sand, water, clouds, surfing

Because each patch can only attend to its neighbors, the performance of this model takes a hit. But with this compromise we get patch embeddings that are multimodal. This opens up possibilities like achieving open set detection without the need for cross attention between the modalities, interpretability of the model, and more.