Index Creation for Vector Search

Hello,

I have some questions related to what happens “in the shadows” when creating and index for vector searching in MongoDB Atlas Vector Search.

How are HNSW graphs constructed when generating a Vector Search index? What values are typically chosen for mL, efConstruction, M, Mmax, or Mmax0? Is there a process to determine optimal values for each index being generated, or are there static pre-established values?

Thank you!

Hi Clara,

Thank you for the question! I think there are two pieces to the answer based on section 4.1 from the original HNSW paper. TL; DR there are sane defaults for this set within Lucene that we rely on, but it may be beneficial for us to explore exposing a subset of the ones you’ve described for certain use cases.

  1. Primary construction parameters that determine others

The paper specifies that certain construction params are dependent on others, and an optimal configuration requires defining a relationship between them, and only tweaking the independent variable. These primaries are as follows (also indicated in lucene docs):

  • M (maxConn in lucene)
  • efConstruction (beamWidth in lucene)

The formulas for computing secondary variables are also defined:

  • mL = 1/ln(M)
  • Mmax0 = 2 * M
  • Mmax = M (note that this specifically isn’t specified in the paper as far as i can tell, but this is the sane default set in FAISS that the industry seems to be running with)
  1. Sane defaults for primary construction parameters

The paper specifies a reasonable range of M is from 5 to 48. This parameter is the one that would likely most benefit from being exposed, since the default of 16 is shown to work well for general use cases, but may need to be scaled up as the number of vectors increases to not see recall degrade. I believe optimizing for the 80% most common vector use cases where this makes sense is sensible, but could imagine seeing performance improvements if you were to bump this up to say 32 from the default.

The paper suggests a value of 100 for efConstruction, with minimal performance improvement beyond that. Note that this latter bit provides improved recall at the expense of indexing time, so there may be a benefit if you have a high throughput indexing workload to reducing this at the cost of accuracy. This seems less beneficial generically to expose compared to the first parameter, as query performance typically is more of a concern than indexing time for the majority of use cases, but there are definitely exceptions to this (e.g. near-real time vector search indexing).

You can find the Lucene defaults for these of 16 and 100 here.

In general I would love to learn more about what experimenting with different values for your use case could look like, but it doesn’t seem like its as urgent of a problem given the sane defaults of 16, 100 for the vast majority of folks. My suggestion if you’re still curious would be to experiment with a different system that does expose these and see if you notice any tangible benefit when holding other query params constant.

Hope this helps!

Hi Henry,

Thanks a lot for your detailed response! Your insights on the construction parameters of the graphs and their default values were really helpful. I now have a clearer understanding of how things work and I trust that the default values used for primary construction parameters (16 for M and 100 for efConstruction) will generally work well for any use case.

I appreciate your time and guidance!