Content-based image retrieval with Computer Vision

Modern CBIR: pixel similarity to embedding-space ANN search with FAISS, HNSW. Embedding choice, recall vs latency, production architecture.

Content-based image retrieval with Computer Vision
Written by TechnoLynx Published on 26 May 2025

Introduction

Content-based image retrieval (CBIR) is one of the parts of computer vision that has been quietly re-architected by the deep-learning era and most legacy documentation has not caught up. The pre-2018 stack — SIFT, SURF, colour histograms, bag-of-visual-words — has been replaced by embedding-model encoding plus approximate nearest-neighbour (ANN) search over the embedding space. The new stack is 10–50× faster at higher relevance for most production use cases. This article walks the modern CBIR architecture, the embedding-model choice, the index/recall trade-off, and the operational metrics that matter (recall@k at p99 latency). See computer vision for the broader practice.

The naive read is “CBIR means SIFT-style feature matching.” The expert read is that CBIR in 2026 is embedding search — and the work is choosing the right embedding model, the right ANN index, and the right operational metric for the use case.

What this means in practice

  • Embedding choice (general CLIP-class vs domain-fine-tuned) drives both relevance and latency more than any other decision.
  • ANN index choice (FAISS, ScaNN, HNSW, IVF-PQ) decides the recall/latency/memory trade-off.
  • The operational metric is recall@k at p99 latency, not raw accuracy on a benchmark.
  • Production CBIR systems need re-indexing infrastructure, not just initial indexing.

The 2010s pipeline was: extract handcrafted descriptors (SIFT, SURF, ORB) from each image, aggregate into bag-of-visual-words representations, compute distance in the BoVW space, and return nearest matches. The 2026 pipeline is: encode each image through a deep embedding model (CLIP, DINOv2, domain-fine-tuned ResNet/ViT) into a fixed-dimensional vector (typically 512 to 2048 dimensions), index the vectors in an ANN structure, and query by encoding the query image and retrieving nearest vectors.

The shift is not incremental. Embedding-based retrieval captures semantic similarity (two photos of the same product from different angles are close in embedding space; the same descriptor approach would not match them). It scales to billions of images via ANN indices that handle nearest-neighbour queries in milliseconds. And it integrates naturally with text queries (CLIP-class models map text and images into the same space), enabling cross-modal retrieval that the handcrafted-feature approach cannot support at all.

What is the architectural difference between classical CBIR and text-based image retrieval today?

Classical CBIR matches images to images by visual content. Text-based image retrieval matches text queries to images by semantic content — and in the modern stack, both use the same embedding model and the same index. CLIP-class models (CLIP, OpenCLIP, SigLIP, BLIP-2) encode text and images into a shared embedding space; the same FAISS or HNSW index serves both query types.

Architecturally, the difference collapses to the query side: a CBIR query encodes an image and retrieves nearest images; a text query encodes text and retrieves nearest images. Hybrid retrieval — combining image and text queries — uses both encodings and merges the result lists (or sums the vectors in the shared space). The unified architecture is the reason production CBIR systems built in 2023–2026 typically support text queries by default; the marginal cost is small once the embedding stack is in place.

Where is CBIR used in production?

E-commerce visual search is the most public application: shoppers photograph a product or upload an image and the system retrieves similar products. The relevance bar is high because misranking erodes trust quickly. Media archives use CBIR for asset discovery — broadcast and stock-photo companies maintain billion-image archives where filename search is hopeless and editorial metadata is partial.

Medical imaging uses CBIR for case-similarity retrieval — given a radiology study, retrieve historically similar studies with confirmed diagnoses. The relevance and provenance bars are extreme. Security and surveillance use CBIR for incident reconstruction (given a frame of interest, retrieve historically similar frames) and for identity matching across cameras. Industrial inspection uses CBIR for defect-similarity retrieval — given a flagged part, retrieve historically similar flagged parts to refine the classification.

How do deep-learning embeddings and ANN indexes change retrieval latency and recall?

Deep embeddings reduce the per-image retrieval work to a vector distance computation. Modern ANN indices then exploit the geometric structure of the embedding space to avoid exhaustive search. HNSW (Hierarchical Navigable Small World graphs) typically achieves >0.95 recall at <10ms latency over hundreds of millions of vectors with reasonable memory. IVF-PQ (Inverted File with Product Quantisation) trades some recall for much lower memory footprint, suitable for billion-scale indices where memory is the binding constraint. ScaNN (Google’s library) is competitive with HNSW with different memory characteristics.

The recall/latency/memory trade-off is the design space: HNSW is fast and high-recall but memory-heavy; IVF-PQ is memory-efficient but trades recall; flat exhaustive search is the highest recall but does not scale. Production deployments typically use HNSW for hot-path retrieval with a flat-index re-rank of the top-N candidates for the final result, achieving both speed and accuracy.

What are the trade-offs between content-based and content-plus-text retrieval (CLIP-style)?

Content-only retrieval (image-encoder-only, e.g. DINOv2-style) often beats CLIP-class models for pure visual similarity — fine-grained product matches, near-duplicate detection, visual style matches. The image encoder is optimised for visual representation without the text-alignment trade-off.

CLIP-class retrieval (joint image+text models) enables text queries against the image index, which is operationally valuable but trades some visual-only relevance for the cross-modal capability. For the e-commerce visual-search case where users also type queries, the trade is worthwhile; for the pure visual-similarity case (defect matching, near-duplicate detection), the content-only model often wins. The pragmatic deployment uses both: a CLIP-class model for the text query path and a domain-fine-tuned content-only model for the visual query path, indexed separately.

Where does content-based video retrieval extend or break the CBIR patterns used for images?

Video retrieval extends CBIR by treating videos as sequences of frame embeddings — temporal aggregation (averaging, attention-pooling, transformer-encoding the frame sequence) produces a clip-level embedding that can be indexed and queried with the same ANN infrastructure. Many practical video-retrieval systems work this way: sample one frame per second, embed each, aggregate, index.

Video breaks the CBIR pattern when the query involves temporal structure that frame-level embeddings cannot capture — “find videos where a person walks left-to-right then sits down” is a temporal-pattern query that needs a video-native model. Storage is also an issue: a one-hour video sampled at one fps produces 3,600 frame embeddings; the index size grows fast. Cross-modal video search (text-to-video) is improving but lags text-to-image; the joint-embedding training data is smaller and the temporal alignment makes the supervision harder.

How TechnoLynx Can Help

TechnoLynx builds production CBIR systems from the embedding-model choice through ANN index selection, re-indexing infrastructure, and the operational monitoring (recall@k at p99) that decides whether the system actually serves users well. If you have a CBIR use case — e-commerce visual search, archive discovery, medical case retrieval — contact us for an architecture review.

Image credits: Freepik

Back See Blogs
arrow icon