What Is a Vector Database?
A vector database is a specialized data store designed to index, store, and query high-dimensional numeric vectors — commonly called embeddings — and perform similarity searches over them at scale. Where a relational database answers "find rows where column X equals Y," a vector database answers "find items most similar to this query," measured by distance in a high-dimensional space.
Vector databases have become a critical component of enterprise AI infrastructure because they are the engine behind retrieval-augmented generation (RAG), semantic search, recommendation systems, and any AI application that needs to find "things that mean the same" rather than "things that match exactly." Their adoption has grown sharply since 2023 as LLM-based applications moved from prototype to production.
A vector database stores numeric embeddings of text, images, or structured data and retrieves them by semantic similarity rather than exact match. It's the retrieval engine in most RAG architectures. The business challenge: vector indexes are opaque — governance requires tracking what was indexed, when, by whom, and from what source to ensure AI retrieval stays accurate and auditable.
Vectors and Embeddings
An embedding is a numeric representation of a piece of information — a sentence, a document, an image, a data record — produced by a machine learning model. The model maps the input into a high-dimensional space (typically 768–4096 dimensions) such that items with similar meaning are placed near each other in that space.
For example, a text embedding model converts "What is data quality?" and "How is data quality measured?" into vectors that are geometrically close — even though they share few literal words. A document about the GDPR and a document about EU data privacy law will have similar embeddings. A document about basketball will be far away from both.
This geometric structure is what enables semantic search: find the vectors (and their corresponding documents) that are closest to the query vector. The measure of closeness is typically cosine similarity (the angle between vectors) or Euclidean distance.
How Vector Databases Work
Naively searching for the nearest vector among millions of documents requires comparing the query to every stored vector — too slow for production. Vector databases solve this with approximate nearest neighbor (ANN) indexing algorithms that trade a small accuracy loss for orders-of-magnitude search speed improvement.
Key Indexing Algorithms
- HNSW (Hierarchical Navigable Small World) — Graph-based index that supports high-accuracy search with low latency. Default choice for most use cases. Memory-intensive but fast.
- IVF (Inverted File Index) — Clusters vectors and searches only the relevant clusters. Better for very large datasets where HNSW memory requirements become prohibitive.
- PQ (Product Quantization) — Compresses vectors to reduce memory usage at some accuracy cost. Used in combination with IVF for billion-scale indexes.
Metadata Filtering
Production vector search almost always combines similarity search with metadata filtering: "find the most similar documents to this query, but only from the Finance domain, and only published after 2024-01-01." Vector databases handle pre-filtering (filter before searching the ANN index) or post-filtering (filter after) depending on the system and selectivity of the filter.
Major Platforms
The market has consolidated around a handful of platforms: Pinecone (managed, cloud-native), Weaviate (open-source, hybrid vector/BM25 search), Qdrant (open-source, Rust-based, strong payload filtering), Milvus/Zilliz (distributed, enterprise-grade), and pgvector (Postgres extension, lowest barrier to entry). Most major cloud data warehouses (Snowflake, BigQuery, Databricks) now have built-in vector search capabilities, reducing the need for a separate vector database in some architectures.
Use Cases
Vector databases power a growing set of enterprise applications:
- RAG knowledge bases — The most common enterprise use case. Chunk documents (policies, reports, wikis) into segments, embed them, and store in a vector database. At query time, retrieve the most relevant segments to provide the LLM with grounded context.
- Semantic search — Replace keyword search (which fails when users don't know exact terminology) with meaning-based search. "Show me datasets about customer behavior" finds datasets tagged "user engagement," "purchase history," and "session analytics" — not just datasets with "customer" in the name.
- Duplicate detection — Find near-duplicate records, documents, or data entries by similarity rather than exact match. Useful for data deduplication, compliance (finding near-duplicate contracts), and content moderation.
- Recommendation systems — Find items similar to what a user has interacted with. Embeddings capture latent features that keyword matching misses.
Vector DB vs Other Databases
Vector databases complement rather than replace existing data infrastructure:
- vs Relational databases — Relational databases excel at exact-match queries and transactional workloads. Vector databases excel at similarity search. Many applications need both: find rows matching a filter (relational), then rank them by semantic similarity (vector).
- vs Document stores — Document stores (Elasticsearch, MongoDB) support full-text search with keyword matching. Vector databases provide semantic similarity. Hybrid search — combining BM25 keyword scoring with vector similarity — often outperforms either alone.
- vs Graph databases — Knowledge graphs encode explicit typed relationships. Vector databases encode implicit similarity in continuous space. GraphRAG combines both: use the graph for structured retrieval, the vector index for content retrieval.
Governance Challenges
Vector databases introduce governance challenges that most data teams haven't faced before:
A vector index is a derived data asset. It was created from source documents at a point in time, using a specific embedding model. When source documents change, the index becomes stale. When the embedding model is upgraded, old vectors are no longer comparable to new ones. Governing a vector index requires treating it like any other data pipeline: tracking its lineage, monitoring its freshness, and managing model version changes.
- Freshness — Vector indexes go stale as source documents are updated, added, or deprecated. A governance-aware deployment tracks when each chunk was last indexed and triggers re-embedding when source content changes.
- Access control — Documents with restricted access shouldn't be retrievable by unauthorized users through the vector database. Row-level security must be enforced at retrieval time, typically through metadata filters on the vector query.
- Embedding model versioning — When the embedding model changes, old and new vectors are incompatible. A model upgrade requires re-embedding the entire corpus — a non-trivial operational event that needs planning and rollback capability.
- Lineage — Tracking which source documents contributed to which vector index, and which version of each document, is the vector database equivalent of data lineage. Without it, debugging retrieval failures is guesswork.
In Enterprise AI
For enterprise AI deployments, the vector database is one component of a broader context engineering stack. It provides semantic retrieval — finding relevant documents and passages — but enterprise AI also needs structured facts, governed definitions, and explicit relationships that vector search alone can't provide.
The emerging pattern pairs vector search with a governed data layer: the vector database finds semantically relevant documents, and a knowledge graph or data catalog provides structured, authoritative facts about the entities those documents reference. Together, they give the LLM both rich context (from the vector database) and trustworthy ground truth (from the governed metadata layer).
Conclusion
Vector databases have moved from research infrastructure to standard enterprise data stack component. They enable semantic search and retrieval at scale — capabilities that underpin modern AI applications. But like any data asset, they require governance: tracking provenance, managing freshness, enforcing access control, and monitoring quality. Organizations that treat vector indexes as governed data assets — rather than opaque AI infrastructure — will build AI systems that remain reliable and trustworthy as their data evolves.