How to sample unique documents in a collection

Hi Team,

As part of sampling, if I have 10000 documents with different sampling of data. In order to find unique documents with their data type. How do I find it in compass or any MongoDB query available.

Example:

db.collection_001.drop();
db.createCollection(“collection_View_Aggregation_042”);
db.collection_001.insert(
{
   “_id” : 1,
   “item” : “”,
   “price” :
});

db.collection_001.insert(
{
   “_id” : ,
   “item” : “”,
   “price” : ,
   “quantity” :
});

db.collection_001.insert(
{
   “_id” : ,
   “item” : “”,
   “price” : ,
   “quantity” : ,
   “date” : ISODate(“2014-01-01T08:15:39.736Z”)
});

Increasing to 10000 documents ith different data type. In order to find unique documents from these collection is very difficult.

How do i fetch unique document name along with data type? Is any shortcut available?

w
Thanks & Regards,
Jay

To be precise, I have 1000 documents in 1 collection.
Document 1 - 10 different Field name (_id, name,dob, address,pin_code, Co-ordinates,etc.,)
Document 2 - Only 3 Field Name(_id, Name, Dob)
Document 3 - Contains 45 Field Name
Document 4 - Contains 150 Field Name
Document 5 - Contains 250 Field Name
Document 6 - Contains 200 Field Name
.
.
.
so on.

Field names may be similar or may vary from one document to other. Same is called “Sampling” in MongoDB.

If i want to find all Field name which is available in 1 Collection i have to manually verify all 1000 records. If i want to find all field with in the collection, irrespective of document, how do i find it?

Hello, @Jayaprakash_Nagappan!

The required result can be achieved mainly by using $objectToArray + $group stages.

Keep in mind, when using the examples below, that you may reach 16MB BSON document size limit, if there is a big variety of prop names or some prop names present in the most of documents in a huge collection.

For the example purpose, let’s simplify your documents structure to this:

db.test1.insertMany([
  { propA: 1 },
  { propA: '1', propB: true },
  { propA: 1, propC: [1, 2, 3] }
]);

Case 1: You need to get the list of unique names of properties of all documents in your collection.

Expected output:

{
  "_id" : null,
  "uniquePropNames" : [
    "propB",
    "propA",
    "propC",
    "_id"
  ]
}

Aggregation:

db.test1.aggregate([
  {
    $addFields: {
      list: {
        $objectToArray: '$$CURRENT',
      }
    }
  },
  {
    $unwind: '$list',
  },
  {
    $group: {
      _id: null,
      uniquePropNames: {
        $addToSet: '$list.k'
      },
    }
  }
]).pretty();

Case 2: You need to get the list of unique names of properties of all documents in your collection and each property should return unique list of types, it has throughout the documents per same collection - use this aggregation.

Expected output:

{ "propTypes" : [ "array" ], "propName" : "propC" }
{ "propTypes" : [ "bool" ], "propName" : "propB" }
{ "propTypes" : [ "string", "double" ], "propName" : "propA" }

Aggregation:

db.test1.aggregate([
  {
    $addFields: {
      list: {
        $objectToArray: '$$CURRENT',
      }
    }
  },
  {
    $unwind: '$list',
  },
  {
    $match: {
      'list.k': {
        $ne: '_id',
      }
    }
  },
  {
    $group: {
      _id: '$list.k',
      propTypes: {
        $addToSet: {
          $type: '$list.v',
        }
      }
    }
  },
  {
    $project: {
      _id: false,
      propName: '$_id',
      propTypes: true,
    }
  }
]).pretty();

Case 3: You need to get the list of unique names of properties of all documents in your collection and list Ids of documents, that have this prop.

Expected output:

{
  "documentIdsThatHoldProp" : [
    ObjectId("5f02435b9cb1606d755fd1e3")
  ],
  "propName" : "propB"
}
{
  "documentIdsThatHoldProp" : [
    ObjectId("5f02435b9cb1606d755fd1e2"),
    ObjectId("5f02435b9cb1606d755fd1e3"),
    ObjectId("5f02435b9cb1606d755fd1e4")
  ],
  "propName" : "propA"
}
{
  "documentIdsThatHoldProp" : [
    ObjectId("5f02435b9cb1606d755fd1e4")
  ],
  "propName" : "propC"
}

Aggregation:

db.test1.aggregate([
  {
    $addFields: {
      list: {
        $objectToArray: '$$CURRENT',
      }
    }
  },
  {
    $unwind: '$list',
  },
  {
    $match: {
      'list.k': {
        $ne: '_id',
      }
    }
  },
  {
    $group: {
      _id: null,
      uniquePropNames: {
        $addToSet: {
          name: '$list.k',
          documentId: '$_id'
        }
      },
    }
  },
  {
    $unwind: '$uniquePropNames',
  },
  {
    $group: {
      _id: '$uniquePropNames.name',
      documentIdsThatHoldProp: {
        $push: '$uniquePropNames.documentId',
      }
    }
  },
  {
    $project: {
      _id: false,
      propName: '$_id',
      documentIdsThatHoldProp: true,
    }
  }
]).pretty();

Case4: You need to get the list of unique names of properties of all documents in your collection and list Ids all possible types, that each unique property can have, mapped to document ids, that hold property value of a given type.

Expected output:

{
  "variations" : [
    {
      "propType" : "string",
      "documentIds" : [
        ObjectId("5f02435b9cb1606d755fd1e3")
      ]
    },
    {
      "propType" : "double",
      "documentIds" : [
        ObjectId("5f02435b9cb1606d755fd1e2"),
        ObjectId("5f02435b9cb1606d755fd1e4")
      ]
    }
  ],
  "propName" : "propA"
}
{
  "variations" : [
    {
      "propType" : "array",
      "documentIds" : [
        ObjectId("5f02435b9cb1606d755fd1e4")
      ]
    }
  ],
  "propName" : "propC"
}
{
  "variations" : [
    {
      "propType" : "bool",
      "documentIds" : [
        ObjectId("5f02435b9cb1606d755fd1e3")
      ]
    }
  ],
  "propName" : "propB"
}

Aggregation:

db.test1.aggregate([
  {
    $addFields: {
      list: {
        $objectToArray: '$$CURRENT',
      }
    }
  },
  {
    $unwind: '$list',
  },
  {
    $match: {
      'list.k': {
        $ne: '_id',
      }
    }
  },
  {
    $group: {
      _id: {
        propName: '$list.k',
        propType: {
          $type: '$list.v',
        }
      },
      documentIds: {
        $push: '$_id',
      }
    }
  },
  {
    $group: {
      _id: '$_id.propName',
      variations: {
        $push: {
          propType: '$_id.propType',
          documentIds: '$documentIds'
        }
      }
    }
  },
  {
    $project: {
      _id: false,
      propName: '$_id',
      variations: true,
    }
  }
]).pretty();