Statistics on a collection by the latest value of a field

Hello all,

I have a collection stores the user login record. It contains the user_id, login_time and app_version, the data looks like:

userID_a, 2022-04-08T00:00:15Z, "1.0"
userID_b, 2022-04-08T00:10:15Z, "1.1"
userID_a, 2022-04-09T01:00:15Z, "1.0"
userID_b, 2022-04-10T00:10:15Z, "1.2"
userID_a, 2022-04-09T10:00:15Z, "1.1"

I want to calculate the percentage of each version of application used by user. Here the final result should be “1.1”: 50%, “1.2” :50%, because userID_A and userID_b seems upgrade their app version, and 1.0 not used by any user.

So I want to distinct the collection by userID field, but keep the latest document of each userID, then calculate the percentage of each value in app version.

How to create the index for that purpose? How to write the aggregation or query for such calculation?

Thanks a lot ,

James

Hi @Zhihong_GUO,

Here is my solution in Python:

from datetime import datetime

from pymongo import MongoClient, ASCENDING, DESCENDING

client = MongoClient()
db = client.get_database('test')
logins = db.get_collection('logins')
logins.drop()
logins.create_index([('user_id', ASCENDING), ('date', DESCENDING), ('v', ASCENDING)])
logins.insert_many([
    {'user_id': 'userID_a', 'date': datetime.strptime('2022-04-08T00:00:15Z', '%Y-%m-%dT%H:%M:%SZ'), 'v': '1.0'},
    {'user_id': 'userID_b', 'date': datetime.strptime('2022-04-08T00:10:15Z', '%Y-%m-%dT%H:%M:%SZ'), 'v': '1.1'},
    {'user_id': 'userID_a', 'date': datetime.strptime('2022-04-09T01:00:15Z', '%Y-%m-%dT%H:%M:%SZ'), 'v': '1.0'},
    {'user_id': 'userID_b', 'date': datetime.strptime('2022-04-10T00:10:15Z', '%Y-%m-%dT%H:%M:%SZ'), 'v': '1.2'},
    {'user_id': 'userID_a', 'date': datetime.strptime('2022-04-09T10:00:15Z', '%Y-%m-%dT%H:%M:%SZ'), 'v': '1.1'},
    {'user_id': 'userID_c', 'date': datetime.strptime('2022-04-09T10:00:15Z', '%Y-%m-%dT%H:%M:%SZ'), 'v': '1.1'}
])
pipeline = [
    {
        '$sort': {
            'user_id': 1,
            'date': -1
        }
    }, {
        '$group': {
            '_id': '$user_id',
            'date': {
                '$first': '$date'
            },
            'v': {
                '$first': '$v'
            }
        }
    }, {
        '$group': {
            '_id': '$v',
            'count': {
                '$sum': 1
            }
        }
    }, {
        '$group': {
            '_id': None,
            'total': {
                '$sum': '$count'
            },
            'data': {
                '$push': '$$ROOT'
            }
        }
    }, {
        '$unwind': {
            'path': '$data'
        }
    }, {
        '$project': {
            '_id': 0,
            'v': '$data._id',
            'count': '$data.count',
            'percentage': {
                '$multiply': [
                    100, {
                        '$divide': [
                            '$data.count', '$total'
                        ]
                    }
                ]
            }
        }
    }
]
results = logins.aggregate(pipeline)
for r in results:
    print(r)

Results (I added an extra user_id ‘userID_c’ so it’s not 50/50):

{'v': '1.2', 'count': 1, 'percentage': 33.33333333333333}
{'v': '1.1', 'count': 2, 'percentage': 66.66666666666666}

Here is my aggregation pipeline:

[
  {
    '$sort': {
      'user_id': 1, 
      'date': -1
    }
  }, {
    '$group': {
      '_id': '$user_id', 
      'date': {
        '$first': '$date'
      }, 
      'v': {
        '$first': '$v'
      }
    }
  }, {
    '$group': {
      '_id': '$v', 
      'count': {
        '$sum': 1
      }
    }
  }, {
    '$group': {
      '_id': null, 
      'total': {
        '$sum': '$count'
      }, 
      'data': {
        '$push': '$$ROOT'
      }
    }
  }, {
    '$unwind': {
      'path': '$data'
    }
  }, {
    '$project': {
      '_id': 0, 
      'v': '$data._id', 
      'count': '$data.count', 
      'percentage': {
        '$multiply': [
          100, {
            '$divide': [
              '$data.count', '$total'
            ]
          }
        ]
      }
    }
  }
]

It looks like this in MongoDB Compass with the step by step on the side:

EDIT: Oh and note that I’m using the index {user_id: 1, date: -1, v: 1} which should support this query according to this doc.

Cheers,
Maxime.

@MaBeuLux88_xxx ,

Hello Maxime,

Thank you soooooo much for the detail information! I learned a lot from your answer. Really appreciated!

James

1 Like