Skip to main content
vector databaseembeddingssimilarity searchRAGAIsemantic search

What Is a Vector Database?

A vector database is a specialized data store designed to index, store, and query high-dimensional numeric vectors — commonly called embeddings — and perform similarity searches over them at scale. Where a relational database answers "find rows where column X equals Y," a vector database answers "find items most similar to this query," measured by distance in a high-dimensional space.

Vector databases have become a critical component of enterprise AI infrastructure because they are the engine behind retrieval-augmented generation (RAG), semantic search, recommendation systems, and any AI application that needs to find "things that mean the same" rather than "things that match exactly." Their adoption has grown sharply since 2023 as LLM-based applications moved from prototype to production.

TL;DR

A vector database stores numeric embeddings of text, images, or structured data and retrieves them by semantic similarity rather than exact match. It's the retrieval engine in most RAG architectures. The business challenge: vector indexes are opaque — governance requires tracking what was indexed, when, by whom, and from what source to ensure AI retrieval stays accurate and auditable.

Vectors and Embeddings

An embedding is a numeric representation of a piece of information — a sentence, a document, an image, a data record — produced by a machine learning model. The model maps the input into a high-dimensional space (typically 768–4096 dimensions) such that items with similar meaning are placed near each other in that space.

For example, a text embedding model converts "What is data quality?" and "How is data quality measured?" into vectors that are geometrically close — even though they share few literal words. A document about the GDPR and a document about EU data privacy law will have similar embeddings. A document about basketball will be far away from both.

This geometric structure is what enables semantic search: find the vectors (and their corresponding documents) that are closest to the query vector. The measure of closeness is typically cosine similarity (the angle between vectors) or Euclidean distance.

How Vector Databases Work

Naively searching for the nearest vector among millions of documents requires comparing the query to every stored vector — too slow for production. Vector databases solve this with approximate nearest neighbor (ANN) indexing algorithms that trade a small accuracy loss for orders-of-magnitude search speed improvement.

Key Indexing Algorithms

  • HNSW (Hierarchical Navigable Small World) — Graph-based index that supports high-accuracy search with low latency. Default choice for most use cases. Memory-intensive but fast.
  • IVF (Inverted File Index) — Clusters vectors and searches only the relevant clusters. Better for very large datasets where HNSW memory requirements become prohibitive.
  • PQ (Product Quantization) — Compresses vectors to reduce memory usage at some accuracy cost. Used in combination with IVF for billion-scale indexes.

Metadata Filtering

Production vector search almost always combines similarity search with metadata filtering: "find the most similar documents to this query, but only from the Finance domain, and only published after 2024-01-01." Vector databases handle pre-filtering (filter before searching the ANN index) or post-filtering (filter after) depending on the system and selectivity of the filter.

Major Platforms

The market has consolidated around a handful of platforms: Pinecone (managed, cloud-native), Weaviate (open-source, hybrid vector/BM25 search), Qdrant (open-source, Rust-based, strong payload filtering), Milvus/Zilliz (distributed, enterprise-grade), and pgvector (Postgres extension, lowest barrier to entry). Most major cloud data warehouses (Snowflake, BigQuery, Databricks) now have built-in vector search capabilities, reducing the need for a separate vector database in some architectures.

Vector Database — Embedding and Retrieval Architecture VECTOR DATABASE — EMBEDDING AND RETRIEVAL INGESTION Source Documents Embedding Model [0.12, -0.87,...] Vector Space (2D projection) Data Gov. docs AI/ML docs Finance docs Query RETRIEVAL User Query Same Embedding ANN Index Search (HNSW) Find k nearest neighbors by cosine similarity Apply metadata filters (domain, date, access) Top-K Results 1. LLM Architecture Guide — score: 0.94 2. Transformer Survey Paper — score: 0.91 3. RAG Best Practices — score: 0.88 ANN search finds semantically similar items without scanning every vector (millisecond latency at scale)
Click to enlarge

Use Cases

Vector databases power a growing set of enterprise applications:

  • RAG knowledge bases — The most common enterprise use case. Chunk documents (policies, reports, wikis) into segments, embed them, and store in a vector database. At query time, retrieve the most relevant segments to provide the LLM with grounded context.
  • Semantic search — Replace keyword search (which fails when users don't know exact terminology) with meaning-based search. "Show me datasets about customer behavior" finds datasets tagged "user engagement," "purchase history," and "session analytics" — not just datasets with "customer" in the name.
  • Duplicate detection — Find near-duplicate records, documents, or data entries by similarity rather than exact match. Useful for data deduplication, compliance (finding near-duplicate contracts), and content moderation.
  • Recommendation systems — Find items similar to what a user has interacted with. Embeddings capture latent features that keyword matching misses.

Vector DB vs Other Databases

Vector databases complement rather than replace existing data infrastructure:

  • vs Relational databases — Relational databases excel at exact-match queries and transactional workloads. Vector databases excel at similarity search. Many applications need both: find rows matching a filter (relational), then rank them by semantic similarity (vector).
  • vs Document stores — Document stores (Elasticsearch, MongoDB) support full-text search with keyword matching. Vector databases provide semantic similarity. Hybrid search — combining BM25 keyword scoring with vector similarity — often outperforms either alone.
  • vs Graph databasesKnowledge graphs encode explicit typed relationships. Vector databases encode implicit similarity in continuous space. GraphRAG combines both: use the graph for structured retrieval, the vector index for content retrieval.

Governance Challenges

Vector databases introduce governance challenges that most data teams haven't faced before:

A vector index is a derived data asset. It was created from source documents at a point in time, using a specific embedding model. When source documents change, the index becomes stale. When the embedding model is upgraded, old vectors are no longer comparable to new ones. Governing a vector index requires treating it like any other data pipeline: tracking its lineage, monitoring its freshness, and managing model version changes.

  • Freshness — Vector indexes go stale as source documents are updated, added, or deprecated. A governance-aware deployment tracks when each chunk was last indexed and triggers re-embedding when source content changes.
  • Access control — Documents with restricted access shouldn't be retrievable by unauthorized users through the vector database. Row-level security must be enforced at retrieval time, typically through metadata filters on the vector query.
  • Embedding model versioning — When the embedding model changes, old and new vectors are incompatible. A model upgrade requires re-embedding the entire corpus — a non-trivial operational event that needs planning and rollback capability.
  • Lineage — Tracking which source documents contributed to which vector index, and which version of each document, is the vector database equivalent of data lineage. Without it, debugging retrieval failures is guesswork.

In Enterprise AI

For enterprise AI deployments, the vector database is one component of a broader context engineering stack. It provides semantic retrieval — finding relevant documents and passages — but enterprise AI also needs structured facts, governed definitions, and explicit relationships that vector search alone can't provide.

The emerging pattern pairs vector search with a governed data layer: the vector database finds semantically relevant documents, and a knowledge graph or data catalog provides structured, authoritative facts about the entities those documents reference. Together, they give the LLM both rich context (from the vector database) and trustworthy ground truth (from the governed metadata layer).

Conclusion

Vector databases have moved from research infrastructure to standard enterprise data stack component. They enable semantic search and retrieval at scale — capabilities that underpin modern AI applications. But like any data asset, they require governance: tracking provenance, managing freshness, enforcing access control, and monitoring quality. Organizations that treat vector indexes as governed data assets — rather than opaque AI infrastructure — will build AI systems that remain reliable and trustworthy as their data evolves.

Dawiso
Built with love for our users
Make Data Simple for Everyone.
Try Dawiso for free today and discover its ease of use firsthand.
© Dawiso s.r.o. All rights reserved