I think the whole field of vector databases is mostly just one huge misunderstanding. Most of you are not Google or any other big tech company so so won't have billions of embeddings.
It's crazy how people add bloat and complexity to their stuff just because they want to do medium scale RAG with ca. 2 million embeddings.
Here comes the punchline, you do not need a fancy vector database in this case. I stumbled over https://github.com/sqliteai/sqlite-vector which is a SQLite extension and I wonder why no one else did this before, but it simply implements a highly optimized brute force search over the vectors, so you get sub 100ms queries over millions of vectors with perfect recall. It uses dynamic runtime dispatch that makes use of the available SIMD instructions your CPU has. Turns out this might be all you need. No need for memory a memory hungry search index (like HNSW) or writing a huge index to disk (like DiskANN).
I would like to see a “DataFusion for Vector databases,” i.e. an embeddable library that Does One Thing Well – fast embedding generation, index builds, retrieval, etc. – so that different systems can glue it into their engines without reinventing the core vector capabilities every time. Call it a generic “vector engine” (or maybe “embedding engine” to avoid confusion with “vectorized query engine.”)
Currently, every new solution is either baked into an existing database (Elastic, pgvector, Mongo, etc) or an entirely separate system (Milvus, now Vectroid, etc.)
There is a clear argument in favor of the pgvector approach, since it simply brings new capabilities to 30 years of battle-tested database tech. That’s more compelling than something like Milvus that has to re-invent “the rest of the database.” And Milvus is also a second system that needs to be kept in sync with the source database.
But pgvector is still _just for Postgres_. It’s nice that it’s an extension, but in the same way Milvus has to reinvent the database, pgvector needs to reinvent the vector engine. I can’t load pgvector into DuckDB as an extension.
Is there any effort to make a pure, Unix-style, batteries not included, “vector engine?” A library with best-in-class index building, retrieval, storage… that can be glued into a Postgres extension just as easily as it can be glued into a DuckDB extension?
I think we have so many of those nice open source libraries but the problem is not the library or the algorithm (hsnw or ivf derivatives).. the problem is figuring out the right distributed architecture to balance cost, accuracy (recall) and speed (latency). I believe no single library will give you all that. For instance if you don't separate writes (indexing) from reads (queries) and scale them separately then your indexing will either suck or your indexing will destroy your read latency. You won't be able to scale as easily either. I believe that is why AWS created Aurora and Google Cloud created AlloyDB to scale relational databases (mysql/postgresql) by separating the reads/writes, implementing a scalable storage backend and by offloading a lot of shared works (replication, compaction, indexing) to cluster of machines.
Yeah, I feel like these libraries are all one level lower than what I’m asking for. We need something that makes more assumptions (e.g. “I’m running as a component of some kind of database”) but… makes less decisions? Is more flexible? Idk. This is the hard part.
DataFusion nailed this balance between an embedded query engine and a standalone database system. It brings just the right amount of batteries that it’s not a super generic thing that does nothing useful out of the box, but it doesn’t bring so many that it needs to compete with full database systems.
I believe the maintainers refer to it as “the IR of databases” and I’ve always liked that analogy. That’s what I’d like to see for vector engines.
Maybe what we need as a pre-requisite is the equivalent of arrow/parquet ecosystem for vectors. DataFusion really leverages those standards for interoperability and performance. This also goes a long way toward the architectural decisions you reference — Arrow and Parquet are a solid, “good enough” choice for in-memory and storage formats that are efficient and flexible and well-supported. Is there something similar for vector storage?
I couldn't agree with this more. I don't think the majority of problems with vector search at scale are vector search problems (although filtering + ANN is definitely interesting), they're search-problems-at-scale problems.
Soo… usearch? Its literally one header file (of what use to be strict c++11). Funnily enough that is what is used in the official duckdb-vss extension.
I was starting to think this was impressive, if not impossible. 1B vectors in 48 MB of storage => < 1 bit per vector.
Maybe not impossible using shared/lossy storage if they were sparsely scattered over a large space ?
But anyways - minutes. Thanks.
Edit: Gemini suggested that this sort of (lossy) storage size could be achieved using "Product Quantization" (sub vectors, clustering, cluster indices), giving an example of 256 dimensional vectors being stored at an average of 6 bits per vector, with ANN being one application that might use this.
Hey! It's a great question. Co-founder of Vectroid here.
Today, the differences are going to be performance, price, accuracy, flexibility, and some intangible UI elegance.
Performance: We actually INITIALLY built Vectroid for the use-case of billions of vectors and near single digit millisecond latency. During the process of building and talking to users, we found that there are just not that many use-cases (yet!) that are at that scale and require that latency. We still believe the market will get there, but it's not there today. So we re-focused on building a general purpose vector search platform, but we stayed close to our high performance roots, and we're seeing better query performance than the other serverless, object storage backed vector DBs. We think we can get way faster too.
Price: We optimized the heck out of this thing with object storage, pre-emptible virtual machines, etc. We've driven our cost down, and we're passing this to the user, starting with a free tier of 100GB. Actual pricing beyond that coming soon.
Accuracy: With our initial testing, we see recall greater or equal to competitors out there, all while being faster.
Flexibility: We are going to have a self managed version for users who want to run on their own infra, but admittedly, we don't have that today. Still working on it.
Other Product Elegance: My co-founder, Talip, made Hazelcast, and I've always been impressed by how easy it is to use and how the end to end experience is so elegant. As we continue to develop Vectroid, that same level of polish and focus on the UX will be there. As an example, one neat thing we rolled out is direct import of data from Hugging Face. We have lots of other cool ideas.
Apologies for the long winded answer. Feel free to ping us with any additional questions.
Vectroid is pure Java solution based on modified version of Lucene. We use a custom built FileSystem to work directly with GCS (Google cloud object store). It is a terraform/helm managed Kubernetes deployment.
They show that with 4096-dimensional vectors, accuracy starts to fail at 250 mln documents (fundamental limits of embedding models). For 512-dim, it's just 500k.
Those numbers are for the case where you want all possible pairs of two vectors to have a corresponding query that returns those vectors as the top two results.
If you mostly just want to find a particular single vector if possible and don't care so much what the second-best result is, you can get away with much smaller embeddings.
And if you do want to cover all possible pairs, 6500 dimensions or so should be enough. (Their empirical results roughly fit a cubic polynomial.)
The question is how many, and what kind of VMs you use? It greatly affects performance :)
I run a lot of search-related benchmarks (https://github.com/ashvardanian) and curious if you’ve compared to other engines on the same hardware setup, tracing recall, NDCG, indexing, and query speeds.
We shard the data and index on about 6 x n2-standard-96 spot instances so the total cost of indexing the entire deep1b is less than $12. We are working on to make it $6. We separate indexing and query VMs. For queries we use dedicated VMs. USearch numbers look great and are better than ours if you run the query and indexing on the same VM/node. We believe design-wise distributed, task-oriented design is the right way to handle vector search for thousands of tenants with different size datasets. Data ingest is also a separate task for us so Ingest, Index and Query are all handled by different cluster of VMs.
I assume by "node" OP meant something like a DGX node.
Which yea, that would work, but not everyone (no one?) wants to buy a 500k system to do vector search.
B200 spec:
* 8TB/sec HBM bandwidth
* 10 PetaOPs assuming int8.
* 186GB of VRAM.
If we work with 512-dimensional int8 embeddings, then we need 512GB VRAM to hold them, so assuming we have 8xB200 node (~500k$++), we can easily hold them (125M vectors per GPU).
It takes about 1000 OPs to do the dot product between two vectors, so we need to do 1000*1B = 1TeraOPs, spread over 8 GPUs, that's 125 GigaOPs per GPU, so a fraction of a ms.
Now the bottleneck will be data movement between HBM -> chips, since we have 125M vectors per GPU, aka 64GB, we can move them in ~8 ms.
Here you go, the most expensive vector search in history, giving you the same performance as a regular CPU-based vectorDB for only 1000x the price.
Thanks for doing the math! I suppose if we are charitable in practice we would of course index and only offload partially to VRAM (FAISS does that with IVF/PQ and similar).
Lucene is tough to deal with. About 15 hours ago — right when this comment was posted — I was giving a talk at Databricks comparing the world’s most widely used search engines. I’ve never run into as many issues with any other similar tool as I did with Lucene. To be fair, it’s been around for ~26 years and has aged remarkably well... but it’s the last thing I’d choose today.
co-founder of Vectroid: We forked Lucene. Lucene is awesome for search in general, filters, and obviously full-text search. Very mature and well supported by so many big names and amazing engineers. So we take advantage of that but we had to change a few things to make it work perfectly for Vector use-case. We basically think Vector should be the main data type as it is the most difficult one to deal with. For instance, we modified Lucene to use X number of CPU / threads to build a single segment index. As a result, if/when needed, we can utilize hundreds of CPUs to index quicker and generate less number of segments that will enable lower query latency. We also built a custom File System Directory for Lucene to work off of GCS directly (or S3 later on). It can by-pass the kernel, read from network and write directly into the memory... no SSD, no page-cache, no mmap involved. Perhaps I should not say more...
Seriously. The amount of lift a SaaS product needs to give me is insane for me to even bother evaluating it, and there's a near zero percent chance I'll use it in my core.
I really feel like we're heading down the slope of a large section of the internet dieing off, and if that happens I think it may fracture even more than it already has globally.
What do you think an alternative is for someone who:
1. Has a technical system they think could be worth a fortune to large enterprises, containing at least a few novel insights to the industry.
2. Knows that competitors and open source alternatives could copy/implement these in a year or so if the product starts off open source.
3. Has to put food on the table and doesn’t want to give massive corporations extremely valuable software for free.
Open source has its place, but it is IMO one of the ways to give monopolies massive value for free. There are plenty of open source alternatives around for vector DBs. Do we (developers) need to give everything away to the rich
Let's say the best open source product has a feature score of 70/100, and the best closed source product has a feature score of 85/100, and this is me being generous with the latter. The issue is that just by being closed source, it immediately loses 20/100, bringing its score to 65/100, which is below the open offering. A closed source product carries substantial risk if the company behind it were to stop maintaining it, which is why the adjustment by -20 applies.
Secondly, as I know, the blocker with approximate neighbor search is often not insertion, but search. And if this search was worth a fortune to me, I'd simply embarrassingly parallelize it on CPUs or on GPUs.
Not that locked in - you can just move your vectors to another platform, no?
Vectroid co-founder here. We're huge fans of open source. My co-founder, Talip, made Hazelcast, which is open source.
It might make sense to open source all or part of Vectroid at some point in the future, but at the moment, we feel that would slow us down.
I hate vendor lock-in just as much as the next person. I believe data portability is the ACTUAL counter to vendor lock-in. If we have clean APIs to get your data in, get your data out, and the ability to bulk export your data (which we need to implement soon!), then there's less of a concern, in my opinion.
I also totally understand and respect that some people only want open source software. I'm certainly like that w/ my homelab setup! Except for Plex... Love Plex... Usually.
If HNSW were accurate enough (and if this DB were much faster) then I'd have a use case. I wound up going down a different route to create a differentiable database for ML shenanigans though.
I think the whole field of vector databases is mostly just one huge misunderstanding. Most of you are not Google or any other big tech company so so won't have billions of embeddings.
It's crazy how people add bloat and complexity to their stuff just because they want to do medium scale RAG with ca. 2 million embeddings.
Here comes the punchline, you do not need a fancy vector database in this case. I stumbled over https://github.com/sqliteai/sqlite-vector which is a SQLite extension and I wonder why no one else did this before, but it simply implements a highly optimized brute force search over the vectors, so you get sub 100ms queries over millions of vectors with perfect recall. It uses dynamic runtime dispatch that makes use of the available SIMD instructions your CPU has. Turns out this might be all you need. No need for memory a memory hungry search index (like HNSW) or writing a huge index to disk (like DiskANN).
Might be all you need, except an open source licence:
> For production or managed service use, please contact SQLite Cloud, Inc for a commercial license.
Damn, you're right. That's a deal breaker for me at least.
I would like to see a “DataFusion for Vector databases,” i.e. an embeddable library that Does One Thing Well – fast embedding generation, index builds, retrieval, etc. – so that different systems can glue it into their engines without reinventing the core vector capabilities every time. Call it a generic “vector engine” (or maybe “embedding engine” to avoid confusion with “vectorized query engine.”)
Currently, every new solution is either baked into an existing database (Elastic, pgvector, Mongo, etc) or an entirely separate system (Milvus, now Vectroid, etc.)
There is a clear argument in favor of the pgvector approach, since it simply brings new capabilities to 30 years of battle-tested database tech. That’s more compelling than something like Milvus that has to re-invent “the rest of the database.” And Milvus is also a second system that needs to be kept in sync with the source database.
But pgvector is still _just for Postgres_. It’s nice that it’s an extension, but in the same way Milvus has to reinvent the database, pgvector needs to reinvent the vector engine. I can’t load pgvector into DuckDB as an extension.
Is there any effort to make a pure, Unix-style, batteries not included, “vector engine?” A library with best-in-class index building, retrieval, storage… that can be glued into a Postgres extension just as easily as it can be glued into a DuckDB extension?
I think we have so many of those nice open source libraries but the problem is not the library or the algorithm (hsnw or ivf derivatives).. the problem is figuring out the right distributed architecture to balance cost, accuracy (recall) and speed (latency). I believe no single library will give you all that. For instance if you don't separate writes (indexing) from reads (queries) and scale them separately then your indexing will either suck or your indexing will destroy your read latency. You won't be able to scale as easily either. I believe that is why AWS created Aurora and Google Cloud created AlloyDB to scale relational databases (mysql/postgresql) by separating the reads/writes, implementing a scalable storage backend and by offloading a lot of shared works (replication, compaction, indexing) to cluster of machines.
Yeah, I feel like these libraries are all one level lower than what I’m asking for. We need something that makes more assumptions (e.g. “I’m running as a component of some kind of database”) but… makes less decisions? Is more flexible? Idk. This is the hard part.
DataFusion nailed this balance between an embedded query engine and a standalone database system. It brings just the right amount of batteries that it’s not a super generic thing that does nothing useful out of the box, but it doesn’t bring so many that it needs to compete with full database systems.
I believe the maintainers refer to it as “the IR of databases” and I’ve always liked that analogy. That’s what I’d like to see for vector engines.
Maybe what we need as a pre-requisite is the equivalent of arrow/parquet ecosystem for vectors. DataFusion really leverages those standards for interoperability and performance. This also goes a long way toward the architectural decisions you reference — Arrow and Parquet are a solid, “good enough” choice for in-memory and storage formats that are efficient and flexible and well-supported. Is there something similar for vector storage?
I couldn't agree with this more. I don't think the majority of problems with vector search at scale are vector search problems (although filtering + ANN is definitely interesting), they're search-problems-at-scale problems.
USearch is this type of library: https://github.com/unum-cloud/usearch
Used in ClickHouse and a few other DBMS.
Soo… usearch? Its literally one header file (of what use to be strict c++11). Funnily enough that is what is used in the official duckdb-vss extension.
Disclaimer: I wrote duckdb-vss
We’re building vector indexes into Datafusion for search (starting with S3 vectors).
Open source at https://github.com/spiceai/spiceai
why not use this? https://github.com/facebookresearch/faiss
M is minutes
I was starting to think this was impressive, if not impossible. 1B vectors in 48 MB of storage => < 1 bit per vector.
Maybe not impossible using shared/lossy storage if they were sparsely scattered over a large space ?
But anyways - minutes. Thanks.
Edit: Gemini suggested that this sort of (lossy) storage size could be achieved using "Product Quantization" (sub vectors, clustering, cluster indices), giving an example of 256 dimensional vectors being stored at an average of 6 bits per vector, with ANN being one application that might use this.
Yeah, the SI symbol for minutes is min, if you're going to abbreviate it in a technical context. Super funky using M.
Agree the correct abbreviation is min.
Nitpick: could be wrong but I don’t think minutes is an SI derived unit.
Thank you, title needs edited.
Legend
Thankfully not months.
Oh the horrors of search indexing Ive seen... including weeks / months to rebuild an index.
Not trying to be snarky, just curious -- How is this different from TurboPuffer and other serverless, object storage backed vector DBs?
Hey! It's a great question. Co-founder of Vectroid here.
Today, the differences are going to be performance, price, accuracy, flexibility, and some intangible UI elegance.
Performance: We actually INITIALLY built Vectroid for the use-case of billions of vectors and near single digit millisecond latency. During the process of building and talking to users, we found that there are just not that many use-cases (yet!) that are at that scale and require that latency. We still believe the market will get there, but it's not there today. So we re-focused on building a general purpose vector search platform, but we stayed close to our high performance roots, and we're seeing better query performance than the other serverless, object storage backed vector DBs. We think we can get way faster too.
Price: We optimized the heck out of this thing with object storage, pre-emptible virtual machines, etc. We've driven our cost down, and we're passing this to the user, starting with a free tier of 100GB. Actual pricing beyond that coming soon.
Accuracy: With our initial testing, we see recall greater or equal to competitors out there, all while being faster.
Flexibility: We are going to have a self managed version for users who want to run on their own infra, but admittedly, we don't have that today. Still working on it.
Other Product Elegance: My co-founder, Talip, made Hazelcast, and I've always been impressed by how easy it is to use and how the end to end experience is so elegant. As we continue to develop Vectroid, that same level of polish and focus on the UX will be there. As an example, one neat thing we rolled out is direct import of data from Hugging Face. We have lots of other cool ideas.
Apologies for the long winded answer. Feel free to ping us with any additional questions.
I’m curious, what’s the tech stack behind this?
Vectroid is pure Java solution based on modified version of Lucene. We use a custom built FileSystem to work directly with GCS (Google cloud object store). It is a terraform/helm managed Kubernetes deployment.
Interesting, perhaps, you can write a blog post about it. It would be interesting to read about what kind of changes you made to Lucene.
There was recently this paper: https://arxiv.org/abs/2508.21038
They show that with 4096-dimensional vectors, accuracy starts to fail at 250 mln documents (fundamental limits of embedding models). For 512-dim, it's just 500k.
Is 1 bln vectors practical?
Those numbers are for the case where you want all possible pairs of two vectors to have a corresponding query that returns those vectors as the top two results.
If you mostly just want to find a particular single vector if possible and don't care so much what the second-best result is, you can get away with much smaller embeddings.
And if you do want to cover all possible pairs, 6500 dimensions or so should be enough. (Their empirical results roughly fit a cubic polynomial.)
I would think that 1 bln refers to the row count, not to a vector's length.
Very curious about the hardware setup used for this benchmark!
No special hardware. Google Cloud vms. We use multiple of them during index building.
The question is how many, and what kind of VMs you use? It greatly affects performance :)
I run a lot of search-related benchmarks (https://github.com/ashvardanian) and curious if you’ve compared to other engines on the same hardware setup, tracing recall, NDCG, indexing, and query speeds.
We shard the data and index on about 6 x n2-standard-96 spot instances so the total cost of indexing the entire deep1b is less than $12. We are working on to make it $6. We separate indexing and query VMs. For queries we use dedicated VMs. USearch numbers look great and are better than ours if you run the query and indexing on the same VM/node. We believe design-wise distributed, task-oriented design is the right way to handle vector search for thousands of tenants with different size datasets. Data ingest is also a separate task for us so Ingest, Index and Query are all handled by different cluster of VMs.
By the creator of the real-time data platform https://en.wikipedia.org/wiki/Hazelcast.
1B vectors is nothing. You don’t need to index them. You can hold them in VRAM on a single node and run queries with perfect accuracy in milliseconds
I guess for 2D vectors that would work?
For 1024 dimensions even with 8 bit quantization you are looking at a terrabyte of data. Lets make it binary vectors, it is still 128GB of VRAM.
WAT?
1B x 4096 = 4T scalars.
That doesn't fit in anyone's video ram.
Well we have AI GPUs now so you could do it.
Each MI325x has 256 GB of HBM. So you would need ~32 of em if it was 2 bytes per scalar.
Show your math lol
I assume by "node" OP meant something like a DGX node. Which yea, that would work, but not everyone (no one?) wants to buy a 500k system to do vector search.
B200 spec:
* 8TB/sec HBM bandwidth
* 10 PetaOPs assuming int8.
* 186GB of VRAM.
If we work with 512-dimensional int8 embeddings, then we need 512GB VRAM to hold them, so assuming we have 8xB200 node (~500k$++), we can easily hold them (125M vectors per GPU).
It takes about 1000 OPs to do the dot product between two vectors, so we need to do 1000*1B = 1TeraOPs, spread over 8 GPUs, that's 125 GigaOPs per GPU, so a fraction of a ms.
Now the bottleneck will be data movement between HBM -> chips, since we have 125M vectors per GPU, aka 64GB, we can move them in ~8 ms.
Here you go, the most expensive vector search in history, giving you the same performance as a regular CPU-based vectorDB for only 1000x the price.
Thanks for doing the math! I suppose if we are charitable in practice we would of course index and only offload partially to VRAM (FAISS does that with IVF/PQ and similar).
How is this different from running tuned HNSW vector indices on Elasticsearch?
Lucene is tough to deal with. About 15 hours ago — right when this comment was posted — I was giving a talk at Databricks comparing the world’s most widely used search engines. I’ve never run into as many issues with any other similar tool as I did with Lucene. To be fair, it’s been around for ~26 years and has aged remarkably well... but it’s the last thing I’d choose today.
Interesting, then, that Vectroid would choose to fork it.
Elasticsearch is at least good / at hiding the Lucene zoo under the hood.
co-founder of Vectroid: We forked Lucene. Lucene is awesome for search in general, filters, and obviously full-text search. Very mature and well supported by so many big names and amazing engineers. So we take advantage of that but we had to change a few things to make it work perfectly for Vector use-case. We basically think Vector should be the main data type as it is the most difficult one to deal with. For instance, we modified Lucene to use X number of CPU / threads to build a single segment index. As a result, if/when needed, we can utilize hundreds of CPUs to index quicker and generate less number of segments that will enable lower query latency. We also built a custom File System Directory for Lucene to work off of GCS directly (or S3 later on). It can by-pass the kernel, read from network and write directly into the memory... no SSD, no page-cache, no mmap involved. Perhaps I should not say more...
Aside from being serverless. This is like elasticsearch but with a kind of built in redis-like layer, I think.
Proprietary closed-source lock-in. Nothing to see here.
Seriously. The amount of lift a SaaS product needs to give me is insane for me to even bother evaluating it, and there's a near zero percent chance I'll use it in my core.
Especially a product that demands access to large quantities of your most sensitive data to be useful.
I really feel like we're heading down the slope of a large section of the internet dieing off, and if that happens I think it may fracture even more than it already has globally.
What do you think an alternative is for someone who:
1. Has a technical system they think could be worth a fortune to large enterprises, containing at least a few novel insights to the industry.
2. Knows that competitors and open source alternatives could copy/implement these in a year or so if the product starts off open source.
3. Has to put food on the table and doesn’t want to give massive corporations extremely valuable software for free.
Open source has its place, but it is IMO one of the ways to give monopolies massive value for free. There are plenty of open source alternatives around for vector DBs. Do we (developers) need to give everything away to the rich
Traditionally the most profitable approach is offering enterprise support and consulting.
Enterprises are so very fond of choosing novel open source technologies, too!
(not)
I have been working for 4 years with "enterprise" software, and I feel like the whole field is some kind of collective insanity.
Let's say the best open source product has a feature score of 70/100, and the best closed source product has a feature score of 85/100, and this is me being generous with the latter. The issue is that just by being closed source, it immediately loses 20/100, bringing its score to 65/100, which is below the open offering. A closed source product carries substantial risk if the company behind it were to stop maintaining it, which is why the adjustment by -20 applies.
Secondly, as I know, the blocker with approximate neighbor search is often not insertion, but search. And if this search was worth a fortune to me, I'd simply embarrassingly parallelize it on CPUs or on GPUs.
Not that locked in - you can just move your vectors to another platform, no?
Vectroid co-founder here. We're huge fans of open source. My co-founder, Talip, made Hazelcast, which is open source.
It might make sense to open source all or part of Vectroid at some point in the future, but at the moment, we feel that would slow us down.
I hate vendor lock-in just as much as the next person. I believe data portability is the ACTUAL counter to vendor lock-in. If we have clean APIs to get your data in, get your data out, and the ability to bulk export your data (which we need to implement soon!), then there's less of a concern, in my opinion.
I also totally understand and respect that some people only want open source software. I'm certainly like that w/ my homelab setup! Except for Plex... Love Plex... Usually.
Nothing for you to see here. Surely you just aren't their target customer.
So who is? Who really needs to index 1 billion new vectors every 48 minutes, or perhaps equivalently 1 million new vectors every 3 seconds?
If HNSW were accurate enough (and if this DB were much faster) then I'd have a use case. I wound up going down a different route to create a differentiable database for ML shenanigans though.