Binary Quantized Embeddings
How to use binary quantized embeddings for semantic text search
How to use binary quantized embeddings for semantic text search
Have you ever wanted a search that didn't just matched keywords, but searched based on the word or sentence meaning, returning sentences that meant the same, even if spelt slightly differently?
Let's explore how to create a simple retrieval system using binary quantized embeddings for efficient semantic search.
This sort of system is typically used for a Retrieval Augmented Generation (RAG) system. Where you are pre-preparing a prompt to send to an LLM.
What are we building?
We're going to build a lookup system for RAG that can:
- Convert text into compact binary embeddings
- Store them efficiently in SQLite
- Perform very fast semantic searches
Let's start by understanding the core components we'll be using.
What are embeddings?
Embeddings are numerical representations of text that capture semantic meaning. When we convert text to embeddings, similar meanings end up closer together in the embedding space, enabling semantic search by finding embeddings which are "near" our reference embedding we are searching for.
# Simple example of generating an embedding
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')
text = "Hello world"
embedding = model.encode(text) # This creates a numerical representation
print(f"Embedding shape: {embedding.shape}") # Typically (1024,) or similar
While traditional embeddings use floating-point numbers (typically 32-bit), binary quantized embeddings compress each value to just 1 bit, dramatically reducing storage requirements and improving search speed with minimal accuracy loss.
What is Quantization?
Quantization is the act of reducing high-precision float vectors into smaller number types, such as float16, float8, tiny, int, or bit-based formats so they take up less RAM and multiply faster. So by quantizating the float32 array of numbers, info a float16 array you half the size you need to store, and process. But also half the precision. And if you quantize down to a binary value for each float32, you reduce by 32x, going from 4KB per 1024 value (3 bytes per 1024 float) embedding, down to 0.13 KB (128 bytes) — all while our similarity search still returns roughly the same results.
What is SentenceTransformer?
SentenceTransformer is a Python framework that makes it easy to compute sentence and text embeddings. It's built on top of PyTorch and Transformers and gives us access to hundreds of pre-trained models.
What makes SentenceTransformer particularly useful is how it simplifies the process of generating embeddings with just a few lines of code. The library handles all the complexity of transformer models while giving us easy-to-use interfaces.
# Using SentenceTransformer to compare two sentences
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')
# Generate embeddings for two sentences
embedding1 = model.encode("I love programming")
embedding2 = model.encode("I enjoy coding")
# Calculate similarity (cosine similarity ranges from -1 to 1)
similarity = util.cos_sim(embedding1, embedding2)
print(f"Similarity: {similarity.item():.4f}") # Should be high (close to 1)
What is SQLite-Vec?
SQLite-Vec is a powerful SQLite extension that adds vector search capabilities to SQLite. It's particularly useful for our search system because it:
- Adds vector search operations directly into SQL
- Supports binary quantized embeddings out of the box
- Works with standard SQLite, which is already included in Python
- Provides efficient approximate nearest neighbor search
SQLite-Vec is perfect for our use case because it combines the simplicity of SQLite with specialized vector operations that would otherwise require separate vector databases.
# Basic example of SQLite-Vec setup
import sqlite3
import sqlite_vec
# Connect to SQLite and load the extension
conn = sqlite3.connect("example.db")
conn.enable_load_extension(True)
sqlite_vec.load(conn)
conn.enable_load_extension(False)
# Create a vector table
conn.execute("""
CREATE VIRTUAL TABLE IF NOT EXISTS vectors
USING vec0(
id INTEGER PRIMARY KEY,
embedding bit[1024] -- Binary vectors with 1024 dimensions
)
""")
Binary quantization: Why compress embeddings?
When working with embeddings, you'll quickly run into either storage, or more likely search speed issues. A standard float32 embedding with 1024 dimensions takes 4KB per document. For a million documents, that's 4GB just for embeddings. And you have to do multidimensional vector distance calculations for each of them when searching (though a lot of that can be cleverly optimised to cut down on how many you need to compute).
Binary quantization compresses each number from a 32-bit float to a single bit (0 or 1), resulting in a 32x reduction in size. Here's how we can quantize embeddings:
from sentence_transformers.quantization import quantize_embeddings
# Generate a standard embedding
embedding = model.encode("Binary quantization makes embeddings tiny!", normalize_embeddings=True)
# Quantize to binary (1 bit per dimension)
binary_embedding = quantize_embeddings(embedding, precision="ubinary")
print(f"Original size: {embedding.nbytes} bytes")
print(f"Binary size: {binary_embedding.nbytes} bytes")
print(f"Compression ratio: {embedding.nbytes / binary_embedding.nbytes}x")
While there's a small trade-off in accuracy, the benefits for most applications are enormous:
- 32x smaller storage footprint
- Much faster similarity calculations using bit operations
- Better cache utilization
- Lower memory requirements
And although lookup performance is worse than the full 32bit floats, you still get 96.45% accuracy.
Hamming distance: Faster binary embedding lookup speed
One of the most impressive advantages of binary embeddings isn't just their size—it's how much faster and efficient they are to compare. This is due to a fundamental shift in how we measure similarity.
With traditional float32 embeddings, we typically use cosine similarity, which requires several expensive operations:
- Vector dot products (multiplication operations)
- Vector magnitude calculations (square roots)
- Division operations
In contrast, binary embeddings can use Hamming distance, which simply counts how many bits differ between two vectors. This can be implemented using:
- An XOR operation (extremely fast on modern CPUs)
- A population count (counting the 1s in the result)
These operations are dramatically faster than floating-point math:
# Comparing binary embeddings with Hamming distance
import numpy as np
# Create two binary vectors (already quantized)
binary_vec1 = np.array([0, 1, 1, 0, 1], dtype=np.bool_)
binary_vec2 = np.array([1, 1, 0, 0, 1], dtype=np.bool_)
# Hamming distance calculation (XOR + bit count)
hamming_distance = np.sum(binary_vec1 ^ binary_vec2)
print(f"Hamming distance: {hamming_distance}") # Number of differing bits
SQLite-Vec takes full advantage of this property. When you store binary embeddings using bit[1024]
and the vec_quantize_binary()
function, SQLite-Vec automatically uses the much faster Hamming distance calculations during searches.
According to benchmarks from MixedBread.ai and HuggingFace, this approach can yield a 15-45x speedup in retrieval time (with a mean of 25x) compared to traditional float32 embeddings, while retaining over 95% of the retrieval accuracy.
That's why the combination of binary quantization and SQLite-Vec is so powerful for our RAG system—we get both the storage benefits and the computational speedup.
Setup and Installation
Let's get started by installing the required dependencies:
pip install sentence-transformers
pip install sqlite-vec
We'll use the mixedbread-ai/mxbai-embed-large-v1
model, which generates high-quality text embeddings, and was designed to generate embeddings which support being quantized down to binary while retaining lookup accuracy.
Quick test script to check that the dependencies are working.
# Import Needed libraries
from sentence_transformers import SentenceTransformer
import sqlite3
import sqlite_vec
# Load model
model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')
# Test SQLite-Vec
conn = sqlite3.connect(":memory:")
conn.enable_load_extension(True)
sqlite_vec.load(conn)
conn.enable_load_extension(False)
print("Setup complete! Everything is working correctly.")
Creating embeddings with mixedbread-ai/mxbai-embed-large-v1
The mixedbread-ai/mxbai-embed-large-v1
model generates embeddings and works perfectly with binary quantization.
Example usage:
from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings
# Load the model
model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')
# Generate embeddings for some text
text = "This is an example sentence."
embedding = model.encode(text, normalize_embeddings=True)
# Optionally quantize to binary
binary_embedding = quantize_embeddings(embedding, precision="ubinary")
The normalize_embeddings=True
parameter is important for binary quantization, as it ensures the values are properly distributed for optimal quantization.
Now let's build a complete RAG system that uses SQLite-Vec for storage and binary quantized embeddings for efficient search:
import os
import sqlite3
import sqlite_vec
from sentence_transformers import SentenceTransformer
from typing import List, Tuple
import json
class RAG:
"""Semantic text search for Retrieval Augmented Generation"""
def __init__(self, db_path: str = None):
"""Initialize the RAG system.
Args:
db_path: Path to the SQLite database file
"""
self.db_path = os.path.join(os.path.dirname(__file__), "rag.db") if db_path is None else db_path
self._init_db()
# Initialize the embedding model
self.model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')
def _init_db(self) -> None:
"""Initialize the SQLite database with required tables."""
conn = sqlite3.connect(self.db_path)
conn.enable_load_extension(True)
sqlite_vec.load(conn)
conn.enable_load_extension(False)
# Create virtual table for vector storage with binary embeddings
conn.execute("""
CREATE VIRTUAL TABLE IF NOT EXISTS segments
USING vec0(
id TEXT PRIMARY KEY, -- Segment ID
text TEXT, -- Segment text
embedding bit[1024], -- Binary quantized embedding vector
metadata TEXT -- Optional metadata in JSON format
)
""")
conn.commit()
conn.close()
def _get_db_connection(self) -> sqlite3.Connection:
"""Get a database connection with sqlite-vec extension loaded."""
conn = sqlite3.connect(self.db_path)
conn.enable_load_extension(True)
sqlite_vec.load(conn)
conn.enable_load_extension(False)
return conn
def load_segment(self, id: str, text: str, metadata: str = "{}") -> None:
"""Load a segment into the database.
Args:
id: Unique identifier for the segment
text: The text content
metadata: Optional JSON metadata string
"""
if not text or len(text.strip()) == 0:
return
# Generate embedding using SentenceTransformer
embedding = self.model.encode(text, normalize_embeddings=True)
# Convert embedding to JSON string for SQLite-Vec's vec_quantize_binary function
embedding_json = json.dumps(embedding.tolist())
# Store in database using vec_quantize_binary
conn = self._get_db_connection()
conn.execute("""
INSERT OR REPLACE INTO segments (id, text, embedding, metadata)
VALUES (?, ?, vec_quantize_binary(?), ?)
""", (id, text, embedding_json, metadata))
conn.commit()
conn.close()
def search_segments(self, text: str, limit: int = 10) -> List[Tuple[str, str, float, str]]:
"""Search for segments similar to the given text.
Args:
text: The text to search for
limit: Maximum number of results to return
Returns:
List of tuples containing (id, text, similarity_score, metadata)
"""
# Generate embedding for query text
query_embedding = self.model.encode(text, normalize_embeddings=True)
# Convert query embedding to JSON string
query_embedding_json = str(query_embedding.tolist())
# Search database using vec_quantize_binary for the query
conn = self._get_db_connection()
cursor = conn.execute("""
SELECT id, text, distance, metadata
FROM segments
WHERE embedding MATCH vec_quantize_binary(?) AND k = ?
ORDER BY distance
""", (query_embedding_json, limit))
results = []
for id, text, distance, metadata in cursor.fetchall():
# Convert distance to similarity score (normalize to [0,1] range)
# The distance is a Hamming distance between binary vectors
similarity = 1.0 - (distance / 1024.0) # 1024 is our vector dimension
results.append((id, text, similarity, metadata))
conn.close()
return results
def remove_segment(self, id: str) -> None:
"""Remove a segment from the database.
Args:
id: The ID of the segment to remove
"""
conn = self._get_db_connection()
conn.execute("DELETE FROM segments WHERE id = ?", (id,))
conn.commit()
conn.close()
Using the RAG system
Here's how to use the RAG search:
# Initialize the RAG system
rag = RAG()
# Add some documents
rag.load_segment("doc1", "Tokyo is the capital of Japan.")
rag.load_segment("doc2", "Paris is the capital of France.")
rag.load_segment("doc3", "Berlin is the capital of Germany.")
rag.load_segment("doc4", "Rome is the capital of Italy.")
# Search for similar documents
results = rag.search_segments("What is the capital city of Japan?")
# Print results
for id, text, similarity, metadata in results:
print(f"{id}: '{text}' (similarity: {similarity:.4f})")
Why binary quantization?
Binary quantization offers several advantages:
- Storage efficiency: Reduces embedding size by 32x compared to float32
- Search speed: Binary operations are extremely fast on modern CPUs
- Memory efficiency: Allows storing millions of embeddings in memory
- Simplicity: Bit operations are supported by all database systems
For most applications, the small accuracy trade-off is well worth these benefits. When working with large document collections, binary quantized embeddings make semantic search practical and cost-effective.
Conclusion
We've built a simple but powerful RAG system using binary quantized embeddings. This approach makes semantic search efficient and scalable even with limited resources.
References:
- Binary vector embeddings are so cool
- Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval
- Introduction to Matryoshka Embedding Models
- mxbai-embed-large-v1 embedding model
- sqlite-vec: Github
- I'm writing a new vector search SQLite Extension
- Vector search in 7 different programming languages using SQL
- Introducing sqlite-vec v0.1.0: a vector search SQLite extension that runs everywhere
- Hybrid full-text search and vector search with SQLite
- sqlite-vec: A vector search SQLite extension that runs anywhere!
- SentenceTransformers Documentation
- Sentence Transformers: Embeddings, Retrieval, and Reranking
Example Code
I've added the above example code to Github, with an extra command line interface to let you load in text and query the DB from the command line here.