March 6, 2026·Rivestack Team

pgvector Performance on NVMe vs Cloud SSDs: Benchmarks and Results

pgvector

benchmarks

NVMe

performance

Everyone benchmarking pgvector focuses on CPU, memory, and index parameters. Almost nobody talks about storage, and that's a mistake.

HNSW indexes don't do sequential reads. They jump between nodes in a graph, which means random reads. The difference between NVMe and a cloud block device like AWS gp3 isn't 2x. It's an order of magnitude in IOPS, and it shows up directly in your query latency.

We ran the same benchmark on both storage types with everything else held constant. Here's what we found.

Why Storage Matters for pgvector

When you run a vector similarity search with an HNSW index, PostgreSQL traverses a multi-layer graph. At each step, it reads a node from disk (or from the buffer cache if it's already there). For small datasets that fit in memory, storage doesn't matter because everything is cached. But once your index exceeds shared_buffers, every cache miss becomes a random read.

Here's why that matters:

gp3 volumes deliver 3,000 baseline IOPS (up to 16,000 provisioned). Each random 8KB read has ~0.5ms latency.
NVMe SSDs deliver 500,000+ IOPS with ~0.02ms latency per random read.

A single HNSW search at ef_search=40 visits 40 to 200 nodes. If even 10 of those are cache misses, you're looking at 5ms on gp3 vs 0.2ms on NVMe. Multiply that by concurrent queries and the difference compounds.

Test Setup

We kept everything identical except the storage layer:

Parameter	Value
PostgreSQL	18.2
pgvector	0.8.0
CPU	4 vCPUs (AMD EPYC)
RAM	16 GB
shared_buffers	4 GB
Embedding dimensions	1536 (OpenAI text-embedding-3-small)
HNSW m	16
HNSW ef_construction	200
ef_search	40
Distance metric	Cosine
Benchmark tool	pgbench custom script, 16 concurrent clients
Dataset	Random normalized vectors (uniform distribution)

Storage A: AWS gp3, 500 GB, 3,000 IOPS baseline, 125 MB/s throughput

Storage B: Local NVMe SSD (Intel P5800X), 480 GB, 500K+ random read IOPS

Both systems were warmed up with identical workloads before measurement. Each test ran for 5 minutes after warmup.

Results

10,000 Vectors (Index fits in memory)

Metric	NVMe	gp3	Difference
QPS	4,820	4,650	+3.7%
p50 latency	0.3 ms	0.3 ms	same
p95 latency	0.8 ms	0.9 ms	same
Recall@10	0.99	0.99	same

No meaningful difference. At 10K vectors, the entire HNSW index (~95 MB) fits comfortably in shared_buffers. Storage is never touched during queries.

100,000 Vectors (Index exceeds buffer cache)

Metric	NVMe	gp3	Difference
QPS	3,940	1,820	+116%
p50 latency	0.4 ms	1.2 ms	3x faster
p95 latency	1.1 ms	4.8 ms	4.4x faster
Recall@10	0.99	0.99	same

This is where it starts. The HNSW index at 100K vectors is ~950 MB. With shared_buffers at 4 GB and other tables competing for cache space, cache misses become common. gp3 latency spikes at the tail.

1,000,000 Vectors (Index far exceeds memory)

Metric	NVMe	gp3	Difference
QPS	2,150	410	+424%
p50 latency	0.9 ms	5.2 ms	5.8x faster
p95 latency	2.8 ms	18.6 ms	6.6x faster
Recall@10	0.99	0.99	same

5x more throughput, 6x lower tail latency. At 1M vectors the HNSW index is ~9.5 GB, more than double shared_buffers. Nearly every graph traversal step hits disk. NVMe handles this gracefully. gp3 falls off a cliff.

Note that recall is identical in all cases. Both storage types return the same results. The only difference is how fast they get there.

Why NVMe Wins

The gap comes down to one number: random read IOPS.

gp3 baseline:    3,000 IOPS     =  0.5 ms per random read
gp3 provisioned: 16,000 IOPS    =  0.3 ms per random read
NVMe:            500,000+ IOPS  =  0.02 ms per random read

HNSW search is essentially a series of random pointer chases through a graph. Each visited node requires reading a page from storage (unless cached). The more vectors you have, the larger the index, the more cache misses, and the more you depend on raw IOPS.

Sequential throughput, the number cloud providers love to advertise, barely matters. HNSW never reads pages sequentially during queries. It's random access all the way down.

The gp3 Burst Problem

gp3 volumes have a burst bucket that lets them temporarily exceed 3,000 IOPS. This makes synthetic benchmarks look better than reality. In production, with sustained query traffic, the burst bucket drains and you're back to baseline. Our benchmarks ran long enough (5 minutes post-warmup) to capture steady-state performance, not burst.

Provisioned IOPS Doesn't Close the Gap

You can provision gp3 up to 16,000 IOPS, but that's still 30x fewer than NVMe. At 1M vectors, provisioned IOPS improved QPS from 410 to ~680. Better, but still 3x slower than NVMe. And provisioned IOPS costs $0.065/IOPS/month, so 16,000 IOPS adds $1,040/month to your storage bill.

What This Means for Your Project

NVMe matters when:

Your vector dataset exceeds available memory (the index can't be fully cached)
You need consistent low-latency queries under concurrent load
Your application is latency-sensitive (real-time search, recommendations, RAG pipelines)
You're running at scale (100K+ vectors with production traffic)

NVMe doesn't matter when:

Your dataset is small enough to fit in memory (under ~50K vectors with typical shared_buffers)
You're running batch jobs where latency isn't critical
Your query volume is low (a few QPS won't saturate even gp3)

The threshold depends on your shared_buffers setting, vector dimensions, and HNSW parameters. A rough rule: if your HNSW index size exceeds 50% of shared_buffers, you'll start seeing storage become the bottleneck.

You can check your index size with:

SELECT pg_size_pretty(pg_relation_size('your_index_name'));

How to Reproduce These Benchmarks

We used a simple pgbench custom script:

\set id random(1, :scale)
SELECT id, embedding <=> (
  SELECT embedding FROM items WHERE id = :id
) AS distance
FROM items
ORDER BY embedding <=> (
  SELECT embedding FROM items WHERE id = :id
)
LIMIT 10;

Run with:

pgbench -c 16 -j 4 -T 300 -f bench.sql -D scale=1000000 your_database

The dataset was generated with random normalized 1536-dimensional vectors. Real-world embeddings with clustered distributions may show slightly different cache behavior, but the storage-bound pattern is the same.

The Bottom Line

pgvector performance at scale is a storage problem, not a CPU problem. NVMe delivers 5x the throughput and 6x lower tail latency compared to cloud block storage for datasets that don't fit in memory. No amount of HNSW tuning can compensate for 0.5ms random reads when you need 0.02ms.

We built Rivestack on NVMe for exactly this reason. Every instance runs on local NVMe storage with no network-attached volumes, no IOPS limits, and no burst buckets. If you want to test the difference with your own data, start with the free tier and run your own benchmarks.