Embedding retrieval is super slow

I am currently storing embeddings of 1536 dimensions under “embeddings” tag in documents and I want to quickly filter through the documents and retrieve all the embeddings (around a few hundred), this process is very slow (takes ~50s) cold, is there anyway to speed this up or is this the nature of retrieving a lot of embeddings at 1536 dims each?

I retrieving for visualization purposes so I will be performing dim reduction w/ TSNE after retrieval, is mongo able to do the dim reduction?

I am using the free M0 cluster right now.

Hi @Acadia_AI!

Thank you for submitting the question. We are in the process of releasing a page within our docs that should help with performance tuning considerations, but I should be able to answer this before that goes out, and will link to it here as well when it is released.

I suspect the problem you’re seeing has less to do with dimensionality, and more with going to disk for vector retrieval. The HNSW indexes we use for vector retrieval are very good for CRUD operations and low-latency, accurate approximate nearest neighbor retrieval, but require random access reads of vector values on disk corresponding to the vectors traversed in the structured graph. This means that random access reads are served much, much more quickly when done from memory, rather than from disk, so to ensure subsecond query latency you’d want to make sure that you have enough memory available on your cluster for filesystem cache. A 1536 dimension vector should occupy about 12 kb of space in memory, and ~50% of the available memory in a M0 cluster should be available for filesystem cache to serve those vectors. After a brief warmup period with some representative queries, you will see in your cluster metrics that Disk read IOPS should go to 0 once vectors have been loaded into cache, and future vector reads should all go to memory.

Dedicated search nodes, the recommended way of running our system in production, automatically page vectors into memory, so you shouldn’t have to prewarm the cache in the same way, and should see subsecond performance on all queries provided enough memory is available. In that setting, 80% of RAM can be used, and it should be available at lower cost using the low-cpu option.