Merhaba Aytaç,
I created a ticket to track this work here. I’d expect us to take it on when we do a general push for in-database distributed machine learning along with other popular algorithms.
I am guessing you’d like to do this in the database either because you would like to reduce the data before returning to the client (e.g. return per cluster summaries instead of each individual customer) or expect performance gains from running in the database, even though you still plan on returning a list of all the customers. Also do you intend to run clustering across all users (making it possible to run it once and use it multiple times) or on different subsets (e.g. viewer A could be looking at customers in Region X, while viewer B looking at Region Y and you want to rerun clustering just for X or Y)? Would you mind clarifying?
In the meantime have you considered precomputing centroids using k-means outside MongoDB (e.g. R, Python) then computing squared Euclidian distances between the points and centroids (values from R/Python hardcoded into the query) in MQL then assigning each point to the nearest centroid? That would be a non-trivial MQL query but might possibly help. If you’re rendering all customers it might be easier to do this in Javascript after retrieving the data from MongoDB. Since neighborhood assignment could be done in a single pass vs k-means which takes numerous iterations, that would cut your run time by few orders of magnitude at least. It would also make the results deterministic which you wouldn’t get from k-means with random seed and unless there is a significant change in user count, centroids wouldn’t move much so you would still get meaningful results.
Selamlar,
Bora