Scaling up Vector Search

Luca_Dorigo · June 11, 2024, 5:55pm

Hello,

We are using Vector Search to power our semantic search applications and are trying to find some information on the best ways to structure our application/data for scaling up. Here’s a few questions we couldn’t answer from the docs:

https://mongodb.prakticum-team.ru/docs/atlas/atlas-vector-search/tune-vector-search/ says : “You must ensure that the data nodes have enough RAM to hold the vector data and indexes.” . So I assume that “the vector data” means n_documents*embeddings_dim; where can I see the index size? Is it the size indicated under the “Atlas search” tab (5Gb in the screenshot below), or does that also include the vector data?

image2244×65 14.4 KB
Assuming that the index holds in RAM: how does the performance scale with the amount of vectors? Is it ~O(1)? O(log n)? Something else?
The docs recommend pre-filtering the data to improve performance. How is the performance scale w.r.t. amount of vectors/amount matched by the filter? Say we have 1M vectors and our filter matches 20k of them. Will the vector search performance be the same as searching an index with only 20k vectors?
Is there a limit to how many “partitions” we can have on the vectors for filtering? Say we have a field filter_field which we use to split embeddings in N groups which we use to pre-filter the embeddings before doing the vector search. Can N be arbitrarily large without increasing the
Our embeddings are all in a single collection and are partitioned into N buckets with a field (say bucket_id). We typically search within a group of M buckets by using the pre-filter "$in": [bucket_id_1, ..., bucket_id_M]. How does M impact the performance of the search? Is this approach suitable for scenarios where M might grow very large (100? 1k? 10K?)?
In case the answer to the previous question is that M should stay relatively small, here is the problem we are trying to solve: our current scenario (simplifying a bit) is that we partition data per-customer, where a customer might have a few “buckets” of embeddings (actually, each bucket is a set of text documents that are then split and embedded). So when doing a search for customer X, we get the list of the buckets for that customer that are relevant to the search filter embeddings belonging to that bucket using “$in”. We’re scaling up now and are facing the issue that some customers have massive amounts of buckets where the majority of text document (and thus embeddings) are duplicated in many of the buckets, leading to massive amounts of duplicate embeddings.

One solution to this (c.f. previous question) would be that rather than filtering based on buckets; we filter based on the text documents within the bucket (each embedding would only point to the parent text document, and “buckets” would contain a list of parent text documents). This means that to compute the filter for a query, we’d get the list of text documents within a bucket (potentially very large) and filter embeddings that are $in to any of those text documents.

Are there alternative approaches?

Henry_Weller · June 13, 2024, 3:46pm

Hey @Luca_Dorigo! Thanks for submitting your question. Here are some answers that are hopefully a helpful guide:

Luca_Dorigo:

https://mongodb.prakticum-team.ru/docs/atlas/atlas-vector-search/tune-vector-search/ says : “You must ensure that the data nodes have enough RAM to hold the vector data and indexes.” . So I assume that “the vector data” means n_documents*embeddings_dim; where can I see the index size? Is it the size indicated under the “Atlas search” tab (5Gb in the screenshot below), or does that also include the vector data?

image2244×65 14.4 KB

The size indicated there is the amount of memory you would need to performantly index and serve queries for vector search indexes. There is more detail in the docs as to how to roughly size this by the size of the vector, with metadata taking up some nominal amount of additional space.

Assuming that the index holds in RAM: how does the performance scale with the amount of vectors? Is it ~O(1)? O(log n)? Something else?

Search query time should be roughly O(log n) as you scale up the number of vectors, though this may vary depending on the type of filtering you do.

The docs recommend pre-filtering the data to improve performance. How is the performance scale w.r.t. amount of vectors/amount matched by the filter? Say we have 1M vectors and our filter matches 20k of them. Will the vector search performance be the same as searching an index with only 20k vectors?

Highly selective filters tend to perform better than less restrictive filters since it may trigger an exact search to be performed on a small subset of vectors. In the scenario you described it would depend on how many candidates you are requesting, but if it is very likely that those two searches would have similar performance.

Is there a limit to how many “partitions” we can have on the vectors for filtering? Say we have a field filter_field which we use to split embeddings in N groups which we use to pre-filter the embeddings before doing the vector search. Can N be arbitrarily large without increasing the

There is no limit to the cardinality of a filter field. I would actually expect very high cardinality filters in most cases to perform better than low cardinality filters.

Our embeddings are all in a single collection and are partitioned into N buckets with a field (say bucket_id). We typically search within a group of M buckets by using the pre-filter "$in": [bucket_id_1, ..., bucket_id_M]. How does M impact the performance of the search? Is this approach suitable for scenarios where M might grow very large (100? 1k? 10K?)?

Similar to the previous answers, the more selective the filter is the more likely you are to do an exact search on a very small subset of vectors, so this approach is suitable for scenarios where M might grow very large.

In case the answer to the previous question is that M should stay relatively small, here is the problem we are trying to solve: our current scenario (simplifying a bit) is that we partition data per-customer, where a customer might have a few “buckets” of embeddings (actually, each bucket is a set of text documents that are then split and embedded). So when doing a search for customer X, we get the list of the buckets for that customer that are relevant to the search filter embeddings belonging to that bucket using “$in”. We’re scaling up now and are facing the issue that some customers have massive amounts of buckets where the majority of text document (and thus embeddings) are duplicated in many of the buckets, leading to massive amounts of duplicate embeddings.

Can you help me understand a bit more why there is data duplication? From how you had described it it sounded like a pretty standard multi-tenanted architecture.

Luca_Dorigo · June 14, 2024, 4:47am

I meant to remove this one, I rephrased it in the question below. Unfortunately it doesn’t seem possible to edit a post. Ignore this part.

Luca_Dorigo · July 1, 2024, 9:54am

Hi Henry, thank you for your answer and sorry for the latency in answering, I did not get a notification for your reply!

Let me give some more details of our actual usecase: indeed the general setup is multi-tenant - in our case, our topmost “tenants” (our customers) are educational institutions. However; for each tenant, we have a further subdivision into user groups - in practice, this generally corresponds to a specific instance of a course given at the institution.

At the tenant level, there is indeed little to no duplication; however, we found that at the course level, the vast majority of documents are heavily duplicated across courses (in particular because some institutions tend to run many “copies” of the same course in parallel, with most but not all content shared amongst them). To give an idea, we found that the vast majority (95%+) of course-specific documents are duplicated in up to 50 different courses.

The “dumb” solution for now was to just chunk/embed each of those files 50 times - which will obviously not scale well at all

The easy solution is what I described above - rather than use a filter to get all documents/chunks associated with a course (so the filter would be chunk.document.course_id == XXX ), we would store each document as a unique object and the chunks (which are embedded) would point to the document to which they belong; for each course, we would then store a list of documents, and the queries would use a filter like chunk.document_id in course.document_ids . My question is whether the performance here would be reasonable, since course.document_ids might contain several thousand elements, and I’m not sure whether the vector search index is optimized for this type of queries.

Otherwise, we would have to write logic to factor out the “shared documents” and identify which courses use which sets of shared documents, but that’s significantly more work and housekeeping, so I only want to do this if I’m sure the “easy solution” is not enough.

Henry_Weller · August 7, 2024, 9:29pm

@Luca_Dorigo I believe you should see good performance with a filter of that cardinality. Have you run any performance tests suggesting otherwise?