Wrong document count with mixed _id type

Vinay_Avasthi2 · May 16, 2023, 6:02pm

I have a collection that has mixed types of _id fields. Some documents have strings while some documents have ObjectId. When I load the data using spark connector, by default I only see non ObjectId documents. TO see the ObjectId records I have to specifically use a pipeline { ‘_id’ : {‘$type’: ‘objectId’} }. I am not able to find a way to query all the documents.

Is there a known solution to this problem.

Prakul_Agarwal · May 31, 2023, 7:57pm

Hi @Vinay_Avasthi2,

Have you tried using $exists?

pipeline = [
    {"$match": {"_id": {"$exists": True}}}
]

df = spark.read.format("mongo").option("pipeline", pipeline).load()

Vinay_Avasthi2 · June 1, 2023, 9:26am

I tried this, it still gives wrong count 29863 vs 30605. Only case it works fine is when I create two different RDDs, one with Aggregates.match(Filters.type(“_id”, “objectId”)) and Aggregates.match(Filters.not(Filters.type(“_id”, “objectId”))) and union both the RDDs. But this seems to be expensive compared to just a plain RDD creation.