CLIPPed Faces

When I was building the Adam Curtis search engine, I noticed that the combination of CLIP and FAISS was pretty good at coming up with good suggestions for “Tony Blair” and any other figure that you care to name that pops up throughout the documentaries. And I wondered - I’m sure most of these people turn up in the CLIP training set, being historical figures, but how well would it do on somebody that almost certainly wasn’t (e.g. me)?

import torch
import faiss
import clip
import numpy as np
import faiss.contrib.torch_utils
import glob
from PIL import Image

model, preprocess = clip.load("ViT-B/32", device='cpu', jit=False)
def encode_image(filename, id):
    image_tensor =  model.encode_image(preprocess(Image.open(filename)).unsqueeze(dim=0))
    image_tensor /= image_tensor.norm(dim=-1, keepdim=True)
    return image_tensor, torch.tensor([id],  device="cpu", dtype=torch.long)

Photos of me!

Recognition

Okay, so we’ll grab two images of me, encode them into vectors with CLIP, but obviously if those are the only two vectors I have to choose from, it’s not much of a test. But! I have the Adam Curtis FAISS index just lying around, and it has a great selection of talking heads and other things that makes for a relatively decent test (as Peter Snow would say, this is “just a bit of fun”). I know that the IDs in the Curtis dataset are 64-bit ids, so I’m going to fudge it with the new entries by using low-digit IDs that are not going to collide (and I checked beforehand just to make sure that I wasn’t getting collisions to be on the safe side).

faiss_index = faiss.read_index("curtis.idx")
ref_tensor, ids = encode_image("IMG_0331.jpg", 111)
faiss_index.add_with_ids(ref_tensor, ids)
check_tensor, _ = encode_image("IMG_4069.JPG", 0) 
distances, indices = faiss_index.search(check_tensor, 5)
indices
tensor([[                111, 1914273147895908598, 1911355963158792438,
         1914859604205340918, 1910780652289493238]])

The big surprise is that it does seem to work with no additional training being necessary! With just one photo added, it is already matching me amongst a 150k set of different images. Not too bad! Let’s just make sure that it’s working okay by doing a test against a picture of Mr. Tony Blair.

blair_tensor, _ = encode_image("blair.jpg", 0)

distances, indices = faiss_index.search(blair_tensor, 5)
indices
tensor([[7189953302034286838, 6916916776419100918, 7336140665801869558,
         7180745063950354678, 8920012398545733878]])

And here’s the first result, which is definitely Blair.

Image.open("/home/ian/notebooks/curtis/web/app/static/images/the_trap03_10415556398636929516_003333_7189953302034286838.jpg")

png

Anchoring With Text

The other thing I noticed when working on the search engine is that CLIP is pretty good at reading text. So another approach we could try is adding my name to one of my photos, encoding that and a bunch of other photos of me, then doing a search for “Ian Pointer” and seeing what comes back - whether it can anchor on the text on that image and whether the vector representation of that image is close enough to pull in the other versions of me in the index.

faiss_index = faiss.read_index("curtis.idx")

ref_tensor, ref_ids = encode_image("ian_text.jpg", 111)
check_tensor, check_ids = encode_image("IMG_4069.JPG", 222)

faiss_index.add_with_ids(ref_tensor, ref_ids)
faiss_index.add_with_ids(check_tensor, check_ids)
text_features = model.encode_text(clip.tokenize("Ian Pointer").to("cpu"))

text_features /= text_features.norm(dim=-1, keepdim=True).float()
r = text_features.to('cpu').float()
    
distances, indices = faiss_index.search(r, 5)

indices

tensor([[ 368881852814592245, 2327282344735017205, 9001564851015125238,
         6284035229181315318, 6287482820904650998]])

Okay, that didn’t work so well. But! What if we do something really stupid and just add more instances of my name on the photo so CLIP takes the hint?

faiss_index = faiss.read_index("curtis.idx")

ref_tensor, ref_ids = encode_image("ian_text_lots.jpg", 111)
check_tensor, check_ids = encode_image("IMG_4069.JPG", 222)

faiss_index.add_with_ids(ref_tensor, ref_ids)
faiss_index.add_with_ids(check_tensor, check_ids)


text_features = model.encode_text(clip.tokenize("Ian Pointer").to("cpu"))
text_features /= text_features.norm(dim=-1, keepdim=True).float()
r = text_features.to('cpu').float()
    
distances, indices = faiss_index.search(r, 5)

indices
tensor([[                111,  368881852814592245, 2327282344735017205,
         9001564851015125238, 6284035229181315318]])

So if you add a bunch of text to the image, CLIP will “read” the text, but it feels like the information encoded in the vector space is in a separate cluster to the rest of the image details, as we’re not seeing the other picture of me coming back on following searches. This surprises me a little, as I was seeing the opposite in test queries in the Curtis database, but it’s likely they were matching a lot more in the image clusters of the vector space regardless of the text it was finding.

Hiding

But what if I want to fool the system? Here, I’m taking the picture of Tony Blair, adding my name over the top (which makes me feel a little dirty, but whatever), and then searching for “Ian Pointer” again. Will CLIP be confused?

faiss_index = faiss.read_index("curtis.idx")

ref_tensor, ref_ids = encode_image("blair_text.jpg", 111)
check_tensor, check_ids = encode_image("IMG_4069.JPG", 222)

faiss_index.add_with_ids(ref_tensor, ref_ids)
faiss_index.add_with_ids(check_tensor, check_ids)


text_features = model.encode_text(clip.tokenize("Ian Pointer").to("cpu"))
text_features /= text_features.norm(dim=-1, keepdim=True).float()
r = text_features.to('cpu').float()
    
distances, indices = faiss_index.search(r, 5)

indices
tensor([[                111,  368881852814592245, 2327282344735017205,
         9001564851015125238, 6284035229181315318]])
Image.open("/home/ian/notebooks/curtis/web/app/static/images/cant4_15115732058354422251_003609_368881852814592245.jpg")

png

Again, CLIP zeroes in on the text, but the rest of the returned search items are exactly the same as before, so I don’t think it has been fooled all that much.

Wrapping Up

In this strenuous and stringent bit of testing, it seems like CLIP actually does have some value as a zero-shot facial identification system. Which is vaguely terrifying. It might be interesting to expand this idea out further, maybe with larger and more appropriate datasets like Celeb.A, or if you just happen to have a few million photos hanging around.

And remember, don’t have nightmares.