I have test database of documents with average document size is 32942 Bytes only.
I am getting exceed memory error on below mentioned very simple query and small database (@ 600MB). Code:
resultSet=collection.find({“cs.c_id”:{“$in”: list}})
reccnt=0
for rec in resultSet:
reccnt=reccnt+1 Error:pymongo.errors.DocumentTooLarge: BSON document too large (50853059 bytes) - the connected server supports BSON document sizes up to 16777216 bytes.
Yes, you are right. Query parameter list may be high (approximate 10^6) .
Thank you to help in cause identification.
Actually, it was result to avoid expensive Join operation using $lookup.
It is in two part:
part 1) identified all references from documents satisfying critieria,
These criteria are merged, then
part 2) Identify documents of ID return by part 1
So to avoid a $lookup entirely done with one access to the server by
1 - doing a find that downloads the list of ids to lookup, join or find
2 - doing a second access to the server by uploading the list of ids you got in step 1 to find the documents
So you basically implement your own $lookup in a less efficient way using more access to the server, using more I/O between the client and the server and more CPU on the client which in principal is less powerful than the server.
$lookup
I do not know how accurate your python environment is terms of showing where is the error line but I suspect it is wrong. As far as I know, resultSet is a cursor so I am pretty confident that pymongo will return a valid cursor object. And each record is a document stored in the server, so I don’t see how any single rec from resultSet could be too big.
I think $lookup would be less efficient. You may correct it.
Query here is to filter record from first collection. Collect references from Array from these documents. For this reference ID’s , search documents from second collection.
Two collections are used here, which may be have different distribution on shards.
So, there will be much network I/O between shards to perform $lookup stage.
On other side, two query are used to perform this task. First query is to find reference IDs from first collection by querying it. This process will be done in parallel to all applicable shards. Almost balanced workload . Only IDs are returned from first query. It will be used to filter second collection in balanced workload . No document transfer among cluster nodes will be required except result.
I could by I will not because I do not personally have the resources to test it and I will not use the resources of my customers to test it.
To test if the BSON document too large error is the size of the query, like I think, versus the processing of the result set what you can do is to try insert the query (using the same list) into a temporary collection rather than calling find.
If you get a DocumentTooLarge error then you will know that the query is too big. If it works with exactly the same list that generated the error with find then I have no clue.