Binary Quantized Embeddings

How to use binary quantized embeddings for semantic text search

quantization ai sqlite-vec Transformers embeddings binary SentenceTransformer llm

Posted: 17 May 2025 Last Modified: 17 May 2025

By Kyle Howells 11 min read

How to use binary quantized embeddings for semantic text search

Have you ever wanted a search that didn't just matched keywords, but searched based on the word or sentence meaning, returning sentences that meant the same, even if spelt slightly differently?
Let's explore how to create a simple retrieval system using binary quantized embeddings for efficient semantic search.

This sort of system is typically used for a Retrieval Augmented Generation (RAG) system. Where you are pre-preparing a prompt to send to an LLM.

What are we building?

We're going to build a lookup system for RAG that can:

Convert text into compact binary embeddings
Store them efficiently in SQLite
Perform very fast semantic searches

Let's start by understanding the core components we'll be using.

What are embeddings?

Embeddings are numerical representations of text that capture semantic meaning. When we convert text to embeddings, similar meanings end up closer together in the embedding space, enabling semantic search by finding embeddings which are "near" our reference embedding we are searching for.

# Simple example of generating an embedding
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')
text = "Hello world"
embedding = model.encode(text)  # This creates a numerical representation

print(f"Embedding shape: {embedding.shape}")  # Typically (1024,) or similar

While traditional embeddings use floating-point numbers (typically 32-bit), binary quantized embeddings compress each value to just 1 bit, dramatically reducing storage requirements and improving search speed with minimal accuracy loss.

What is Quantization?

Quantization is the act of reducing high-precision float vectors into smaller number types, such as float16, float8, tiny, int, or bit-based formats so they take up less RAM and multiply faster. So by quantizating the float32 array of numbers, info a float16 array you half the size you need to store, and process. But also half the precision. And if you quantize down to a binary value for each float32, you reduce by 32x, going from 4KB per 1024 value (3 bytes per 1024 float) embedding, down to 0.13 KB (128 bytes) — all while our similarity search still returns roughly the same results.

What is SentenceTransformer?

SentenceTransformer is a Python framework that makes it easy to compute sentence and text embeddings. It's built on top of PyTorch and Transformers and gives us access to hundreds of pre-trained models.

What makes SentenceTransformer particularly useful is how it simplifies the process of generating embeddings with just a few lines of code. The library handles all the complexity of transformer models while giving us easy-to-use interfaces.

# Using SentenceTransformer to compare two sentences
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')

# Generate embeddings for two sentences
embedding1 = model.encode("I love programming")
embedding2 = model.encode("I enjoy coding")

# Calculate similarity (cosine similarity ranges from -1 to 1)
similarity = util.cos_sim(embedding1, embedding2)
print(f"Similarity: {similarity.item():.4f}")  # Should be high (close to 1)

What is SQLite-Vec?

SQLite-Vec is a powerful SQLite extension that adds vector search capabilities to SQLite. It's particularly useful for our search system because it:

Adds vector search operations directly into SQL
Supports binary quantized embeddings out of the box
Works with standard SQLite, which is already included in Python
Provides efficient approximate nearest neighbor search

SQLite-Vec is perfect for our use case because it combines the simplicity of SQLite with specialized vector operations that would otherwise require separate vector databases.

# Basic example of SQLite-Vec setup
import sqlite3
import sqlite_vec

# Connect to SQLite and load the extension
conn = sqlite3.connect("example.db")
conn.enable_load_extension(True)
sqlite_vec.load(conn)
conn.enable_load_extension(False)

# Create a vector table
conn.execute("""
    CREATE VIRTUAL TABLE IF NOT EXISTS vectors 
    USING vec0(
        id INTEGER PRIMARY KEY,
        embedding bit[1024]  -- Binary vectors with 1024 dimensions
    )
""")

Binary quantization: Why compress embeddings?

When working with embeddings, you'll quickly run into either storage, or more likely search speed issues. A standard float32 embedding with 1024 dimensions takes 4KB per document. For a million documents, that's 4GB just for embeddings. And you have to do multidimensional vector distance calculations for each of them when searching (though a lot of that can be cleverly optimised to cut down on how many you need to compute).

Binary quantization compresses each number from a 32-bit float to a single bit (0 or 1), resulting in a 32x reduction in size. Here's how we can quantize embeddings:

from sentence_transformers.quantization import quantize_embeddings

# Generate a standard embedding
embedding = model.encode("Binary quantization makes embeddings tiny!", normalize_embeddings=True)

# Quantize to binary (1 bit per dimension)
binary_embedding = quantize_embeddings(embedding, precision="ubinary")

print(f"Original size: {embedding.nbytes} bytes")
print(f"Binary size: {binary_embedding.nbytes} bytes")
print(f"Compression ratio: {embedding.nbytes / binary_embedding.nbytes}x")

While there's a small trade-off in accuracy, the benefits for most applications are enormous:

32x smaller storage footprint
Much faster similarity calculations using bit operations
Better cache utilization
Lower memory requirements

And although lookup performance is worse than the full 32bit floats, you still get 96.45% accuracy.

Hamming distance: Faster binary embedding lookup speed

One of the most impressive advantages of binary embeddings isn't just their size—it's how much faster and efficient they are to compare. This is due to a fundamental shift in how we measure similarity.

With traditional float32 embeddings, we typically use cosine similarity, which requires several expensive operations:

Vector dot products (multiplication operations)
Vector magnitude calculations (square roots)
Division operations

In contrast, binary embeddings can use Hamming distance, which simply counts how many bits differ between two vectors. This can be implemented using:

An XOR operation (extremely fast on modern CPUs)
A population count (counting the 1s in the result)

These operations are dramatically faster than floating-point math:

# Comparing binary embeddings with Hamming distance
import numpy as np

# Create two binary vectors (already quantized)
binary_vec1 = np.array([0, 1, 1, 0, 1], dtype=np.bool_)
binary_vec2 = np.array([1, 1, 0, 0, 1], dtype=np.bool_)

# Hamming distance calculation (XOR + bit count)
hamming_distance = np.sum(binary_vec1 ^ binary_vec2)
print(f"Hamming distance: {hamming_distance}")  # Number of differing bits

SQLite-Vec takes full advantage of this property. When you store binary embeddings using bit[1024] and the vec_quantize_binary() function, SQLite-Vec automatically uses the much faster Hamming distance calculations during searches.

According to benchmarks from MixedBread.ai and HuggingFace, this approach can yield a 15-45x speedup in retrieval time (with a mean of 25x) compared to traditional float32 embeddings, while retaining over 95% of the retrieval accuracy.

That's why the combination of binary quantization and SQLite-Vec is so powerful for our RAG system—we get both the storage benefits and the computational speedup.

Setup and Installation

Let's get started by installing the required dependencies:

pip install sentence-transformers
pip install sqlite-vec

We'll use the mixedbread-ai/mxbai-embed-large-v1 model, which generates high-quality text embeddings, and was designed to generate embeddings which support being quantized down to binary while retaining lookup accuracy.

Quick test script to check that the dependencies are working.

# Import Needed libraries
from sentence_transformers import SentenceTransformer
import sqlite3
import sqlite_vec

# Load model
model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')

# Test SQLite-Vec
conn = sqlite3.connect(":memory:")
conn.enable_load_extension(True)
sqlite_vec.load(conn)
conn.enable_load_extension(False)

print("Setup complete! Everything is working correctly.")

Creating embeddings with mixedbread-ai/mxbai-embed-large-v1

The mixedbread-ai/mxbai-embed-large-v1 model generates embeddings and works perfectly with binary quantization.

Example usage:

from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings

# Load the model
model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')

# Generate embeddings for some text
text = "This is an example sentence."
embedding = model.encode(text, normalize_embeddings=True)

# Optionally quantize to binary
binary_embedding = quantize_embeddings(embedding, precision="ubinary")

The normalize_embeddings=True parameter is important for binary quantization, as it ensures the values are properly distributed for optimal quantization.

Now let's build a complete RAG system that uses SQLite-Vec for storage and binary quantized embeddings for efficient search:

import os
import sqlite3
import sqlite_vec
from sentence_transformers import SentenceTransformer
from typing import List, Tuple
import json

class RAG:
    """Semantic text search for Retrieval Augmented Generation"""
    
    def __init__(self, db_path: str = None):
        """Initialize the RAG system.
        
        Args:
            db_path: Path to the SQLite database file
        """
        self.db_path = os.path.join(os.path.dirname(__file__), "rag.db") if db_path is None else db_path
        self._init_db()
        # Initialize the embedding model
        self.model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')
    
    def _init_db(self) -> None:
        """Initialize the SQLite database with required tables."""
        conn = sqlite3.connect(self.db_path)
        conn.enable_load_extension(True)
        sqlite_vec.load(conn)
        conn.enable_load_extension(False)
        
        # Create virtual table for vector storage with binary embeddings
        conn.execute("""
            CREATE VIRTUAL TABLE IF NOT EXISTS segments 
            USING vec0(
                id TEXT PRIMARY KEY,       -- Segment ID
                text TEXT,                 -- Segment text
                embedding bit[1024],       -- Binary quantized embedding vector
                metadata TEXT              -- Optional metadata in JSON format
            )
        """)
        
        conn.commit()
        conn.close()
    
    def _get_db_connection(self) -> sqlite3.Connection:
        """Get a database connection with sqlite-vec extension loaded."""
        conn = sqlite3.connect(self.db_path)
        conn.enable_load_extension(True)
        sqlite_vec.load(conn)
        conn.enable_load_extension(False)
        return conn
    
    def load_segment(self, id: str, text: str, metadata: str = "{}") -> None:
        """Load a segment into the database.
        
        Args:
            id: Unique identifier for the segment
            text: The text content
            metadata: Optional JSON metadata string
        """
        if not text or len(text.strip()) == 0:
            return

        # Generate embedding using SentenceTransformer
        embedding = self.model.encode(text, normalize_embeddings=True)
        # Convert embedding to JSON string for SQLite-Vec's vec_quantize_binary function
        embedding_json = json.dumps(embedding.tolist())
        
        # Store in database using vec_quantize_binary
        conn = self._get_db_connection()
        
        conn.execute("""
            INSERT OR REPLACE INTO segments (id, text, embedding, metadata)
            VALUES (?, ?, vec_quantize_binary(?), ?)
        """, (id, text, embedding_json, metadata))
        
        conn.commit()
        conn.close()
    
    def search_segments(self, text: str, limit: int = 10) -> List[Tuple[str, str, float, str]]:
        """Search for segments similar to the given text.
        
        Args:
            text: The text to search for
            limit: Maximum number of results to return
            
        Returns:
            List of tuples containing (id, text, similarity_score, metadata)
        """
        # Generate embedding for query text
        query_embedding = self.model.encode(text, normalize_embeddings=True)
        # Convert query embedding to JSON string
        query_embedding_json = str(query_embedding.tolist())
        
        # Search database using vec_quantize_binary for the query
        conn = self._get_db_connection()
        
        cursor = conn.execute("""
            SELECT id, text, distance, metadata 
            FROM segments 
            WHERE embedding MATCH vec_quantize_binary(?) AND k = ?
            ORDER BY distance
        """, (query_embedding_json, limit))
        
        results = []
        for id, text, distance, metadata in cursor.fetchall():
            # Convert distance to similarity score (normalize to [0,1] range)
            # The distance is a Hamming distance between binary vectors
            similarity = 1.0 - (distance / 1024.0)  # 1024 is our vector dimension
            results.append((id, text, similarity, metadata))
        
        conn.close()
        
        return results
    
    def remove_segment(self, id: str) -> None:
        """Remove a segment from the database.
        
        Args:
            id: The ID of the segment to remove
        """
        conn = self._get_db_connection()
        conn.execute("DELETE FROM segments WHERE id = ?", (id,))
        conn.commit()
        conn.close()

Using the RAG system

Here's how to use the RAG search:

# Initialize the RAG system
rag = RAG()

# Add some documents
rag.load_segment("doc1", "Tokyo is the capital of Japan.")
rag.load_segment("doc2", "Paris is the capital of France.")
rag.load_segment("doc3", "Berlin is the capital of Germany.")
rag.load_segment("doc4", "Rome is the capital of Italy.")

# Search for similar documents
results = rag.search_segments("What is the capital city of Japan?")

# Print results
for id, text, similarity, metadata in results:
    print(f"{id}: '{text}' (similarity: {similarity:.4f})")

Why binary quantization?

Binary quantization offers several advantages:

Storage efficiency: Reduces embedding size by 32x compared to float32
Search speed: Binary operations are extremely fast on modern CPUs
Memory efficiency: Allows storing millions of embeddings in memory
Simplicity: Bit operations are supported by all database systems

For most applications, the small accuracy trade-off is well worth these benefits. When working with large document collections, binary quantized embeddings make semantic search practical and cost-effective.

Conclusion

We've built a simple but powerful RAG system using binary quantized embeddings. This approach makes semantic search efficient and scalable even with limited resources.

References:

Example Code

I've added the above example code to Github, with an extra command line interface to let you load in text and query the DB from the command line here.

Github: ikyle.me-code-examples/Binary Quantized Embeddings Search