How to Build Text Based Image Search

Using multi-modal text and image embedding models for text based and image based image search with python and transformers.

Searching for images is an image classification task. There are 2 types of image classification:

Normal Image classification

These models are trained with a list of categories and predicts which of these categories the image falls into.

Lake

from transformers import pipeline
# Create the pipeline with the `google/vit-base-patch16-224` model
clf = pipeline("image-classification", "google/vit-base-patch16-224")
# Get the categories
results = clf("temp/lake-smaller.jpg")
print(results)
{'label': 'lakeside, lakeshore', 'score': 0.9137738943099976}
{'label': 'seashore, coast, seacoast, sea-coast', 'score': 0.017617039382457733}
{'label': 'promontory, headland, head, foreland', 'score': 0.016670119017362595}
{'label': 'valley, vale', 'score': 0.007515666540712118}
{'label': 'dam, dike, dyke', 'score': 0.004389212466776371}

You can then save these categories (at least the ones scoring above 0.1) in a database and then use these to search and for images matching the search term.

However, the above approach will match "lake" but not "water", or "river".

The google/vit-base-patch16-224 is a small 86.6M params (346 MB) size model which predicts one of the 1000 ImageNet classes.

There is also the larger google/vit-large-patch16-224 model (1.22 GB) which is more accurate, but larger and slower.

The name means:

vit: Vision Transformer (ViT).
base: the size of the model.
patch16: Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded.
224: All images are resized to a resolution of 224x224 pixels before being fed to the model.

Zero-shot Image classification

In order to get arbitrary matching and score the similarity between the search term and the image, regardless of what the search term is you need a zero-shot image classification model.

These are embedding models which have a shared embedding space where the embeddings of text representations of an image, and the image data itself are trained to end up with similar coordinates in the embedding space.

This means when you calculate the distance between the search terms embedding, and the embeddings of several images you get how similar each one is to the search term.

This also means "ice" will be similar to photos of "snow" even though the words are different.

2 good models for this are:

CLIP large is a 0.4B (1.71GB) model. The smaller openai/clip-vit-base-patch32 is only 605MB and is the more popular version. Both as fairly small, quick models.
However, the models were released 2021, and since then newer models have been trained which surpass them in benchmarks.

SigLIP 2 is a family of models released by Google in February 2025, which claims to outperform their previous counterparts at all model scales in core capabilities.
The models come in various sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).
There are also fixed size versions which resize and crop images to a fixed square resolution, or flexible resolution and aspect ratio versions, which divide the image into the fixed size tiles, but divide up images of different sizes and resolutions.

FG-CLIP 2 is a new model released in October which claims to further improve image classification results.

FG-CLIP 2 Benchmarks

It comes in base (1.54GB), large (3.59GB), and so400m (4.62GB) versions.

Simple Zero-shot Classification Example

Lake

If you just want to classify an image into several categories which you provide the huggingface transformers pipeline.

from transformers import pipeline
classifier = pipeline("zero-shot-image-classification", model="openai/clip-vit-large-patch14-336")
candidate_labels = [
    "2 cats",
    "a plane",
    "a remote",
    "a bear",
    "a ocean",
    "a water",
    "a river",
    "a lake",
    "a pond",
    "rain",
    "a mountain",
    "a car",
]
image = load_image("temp/lake-smaller.jpg")
results = image_classifier(image, candidate_labels)
print(results)
{'score': 0.9330312609672546, 'label': 'a lake'}
{'score': 0.02649594098329544, 'label': 'a water'}
{'score': 0.023041551932692528, 'label': 'a river'}
{'score': 0.01358820404857397, 'label': 'a pond'}
{'score': 0.0017832380253821611, 'label': 'a ocean'}
{'score': 0.0006893535610288382, 'label': 'a mountain'}
{'score': 0.0004389677778817713, 'label': 'a phone'}
{'score': 0.00038122545811347663, 'label': 'a remote'}
{'score': 0.0002814349136315286, 'label': 'a plane'}
{'score': 0.00019272200006525964, 'label': 'rain'}
{'score': 3.4059521567542106e-05, 'label': 'a car'}
{'score': 3.0467408578260802e-05, 'label': 'a bear'}
{'score': 1.143596091424115e-05, 'label': '2 cats'}

Using the google/siglip2-so400m-patch16-naflex model we get roughly the same results, though the scores are all much lower (the order is still good though).

from transformers import pipeline
image_classifier = pipeline(model="google/siglip2-so400m-patch16-naflex", task="zero-shot-image-classification")
outputs = image_classifier(image, candidate_labels)
print(outputs)
{'score': 0.1465340554714203, 'label': 'a lake'}
{'score': 0.0038345942739397287, 'label': 'a water'}
{'score': 0.001178617007099092, 'label': 'a river'}
{'score': 0.00041009532287716866, 'label': 'a ocean'}
{'score': 0.00018598142196424305, 'label': 'a mountain'}
{'score': 8.42182053020224e-05, 'label': 'a pond'}
{'score': 3.1798244890524074e-05, 'label': 'rain'}
{'score': 1.0240605661238078e-05, 'label': 'a plane'}
{'score': 1.7888405636767857e-06, 'label': '2 cats'}
{'score': 1.6506612610101001e-06, 'label': 'a bear'}
{'score': 5.040944870415842e-07, 'label': 'a phone'}
{'score': 4.662872470362345e-07, 'label': 'a car'}
{'score': 3.0268274997524713e-08, 'label': 'a remote'}

Text to Image Search

However, if we want to build up a database of image embeddings and do image text and image based searches against those embeddings we need to get the embeddings from these models.

Example Images

Lake

cats

bmw-i7

beech-tree

Generate Image Embeddings

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
import numpy as np
import torch.nn.functional as F

def generate_image_embedding(image_path):
    """
    Generate an L2-normalized image embedding.

    Args:
        image_path: Path to the image file

    Returns:
        Tuple of (filename, embedding vector as numpy array)
    """
    device = get_device()
    model, processor = get_clip_model()

    # Load and convert image to RGB
    image = Image.open(image_path).convert("RGB")

    # Generate embedding
    with torch.no_grad():
        inputs = processor(images=image, return_tensors="pt").to(device)
        features = model.get_image_features(**inputs)
        # L2 normalize for proper cosine similarity
        features = F.normalize(features, p=2, dim=-1)
        embedding = features.cpu().numpy()[0]

    # Extract just the filename from the path
    filename = image_path.split("/")[-1]

    return filename, embedding

Generate Text Images.

def generate_text_embedding(query_text):
    """
    Generate an L2-normalized text embedding.

    Args:
        query_text: Text query string

    Returns:
        Embedding vector as numpy array
    """
    device = get_device()
    model, processor = get_clip_model()

    with torch.no_grad():
        inputs = processor(text=[query_text], return_tensors="pt", padding=True).to(device)
        features = model.get_text_features(**inputs)
        # L2 normalize
        features = F.normalize(features, p=2, dim=-1)
        embedding = features.cpu().numpy()[0]

    return embedding

Load the image embedding

def load_images():
    """
    Load all images and generate their embeddings.

    Returns:
        List of tuples: [(filename, embedding), ...]
    """
    print("Loading and embedding images...")
    embeddings = []

    for image_path in IMAGE_PATHS:
        print(f"  Processing {image_path.split('/')[-1]}...")
        filename, embedding = generate_image_embedding(image_path)
        embeddings.append((filename, embedding))

    print(f"Generated embeddings for {len(embeddings)} images\n")
    return embeddings

Compare image embedding.

def search(query_text, image_embeddings):
    """
    Search for images matching the text query.

    Args:
        query_text: Text description to search for
        image_embeddings: List of (filename, embedding) tuples

    Returns:
        List of (filename, similarity_score) tuples sorted by similarity (best first)
    """
    print(f"Searching for: '{query_text}'")

    # Generate text embedding
    text_embedding = generate_text_embedding(query_text)

    # Compare with all image embeddings
    results = []
    for filename, image_embedding in image_embeddings:
        similarity = cosine_similarity(text_embedding, image_embedding)
        results.append((filename, similarity))

    # Sort by similarity (highest first)
    results.sort(key=lambda x: x[1], reverse=True)

    print("\nResults:")
    print("-" * 50)
    for filename, similarity in results:
        print(f"{filename:30s} | Similarity: {similarity:.4f}")
    print("-" * 50)
    print()

    return results

Example compare the images.

image_embeddings = load_images()
search("a beautiful lake with mountains", image_embeddings)

Results

Searching for: 'a beautiful lake with mountains'

Results:
----------------------------------------------------
lake-smaller.jpg               | Similarity: 0.2865
beech-tree.jpg                 | Similarity: 0.1330
cats.jpg                       | Similarity: 0.1164
bmw-i7.jpg                     | Similarity: 0.1073
----------------------------------------------------
Searching for: 'luxury car'

Results:
----------------------------------------------------
bmw-i7.jpg                     | Similarity: 0.2513
beech-tree.jpg                 | Similarity: 0.1920
cats.jpg                       | Similarity: 0.1818
lake-smaller.jpg               | Similarity: 0.1685
----------------------------------------------------
Searching for: 'tree in nature'

Results:
----------------------------------------------------
beech-tree.jpg                 | Similarity: 0.2992
lake-smaller.jpg               | Similarity: 0.2080
cats.jpg                       | Similarity: 0.1543
bmw-i7.jpg                     | Similarity: 0.1377
----------------------------------------------------
Searching for: 'water and nature'

Results:
----------------------------------------------------
lake-smaller.jpg               | Similarity: 0.2672
beech-tree.jpg                 | Similarity: 0.2209
cats.jpg                       | Similarity: 0.1669
bmw-i7.jpg                     | Similarity: 0.1512
----------------------------------------------------

Search By Image

You can also do an image search by generating an image embedding and using that for the comparisons instead of using the text embedding.

def search(image_path, image_embeddings):
    print(f"Searching for: '{image_path}'")

    # Generate text embedding
    _, search_image_embed = generate_image_embedding(image_path)

    # Compare with all image embeddings
    results = []
    for filename, image_embedding in image_embeddings:
        similarity = cosine_similarity(search_image_embed, image_embedding)
        results.append((filename, similarity))

    # Sort by similarity (highest first)
    results.sort(key=lambda x: x[1], reverse=True)

    print("Results:")
    print("-" * 50)
    for filename, similarity in results:
        print(f"{filename:30s} | Similarity: {similarity:.4f}")
    print("-" * 50)
    print()

    return results

As the image embeddings and text embeddings are designed to end up in a similar location in the multi-dimensional embedding space either can be used for lookups and get good results.

References

Image Classification References

Zero-Shot Image Classification References

CLIP References

SigLIP 2 References

FG-CLIP 2 References