Is this query possible - % of orders in April 2020 that were from new customers

Matt_2x · May 22, 2020, 4:51am

Imagine a simple collection where each record has four primitive attributes - OrderId, CustomerEmail, OrderDate, ProductOrdered.

A business owner asks the question:
What percentage of orders in April 2020 were from new customers?

Important to note:
Same customer for this purpose is where the email is the same.
If your solution uses count <=1 be aware that a customer that orders twice in April is also a a new customer.

For stage 1 the output can be something like this:
New customers April: 13,343
Repeat customers (has at least 1 order prior to 1st april): 3,423

I guess converting a percentage isn’t technically necessary at this stage… but bonus point if done…

I can do this in SQL using CTEs, but we’re looking at ways of allowing stakeholders to model their own queries and sql builders based around string concatatention aren’t appealing, I’m actually looking at query DSLs and how appriorate they are to modelling these types of questions - with the possible view of sticking some UX report builder on top one day.

So back on point, can the Mongo query language model the above question?

Thanks

Prasad_Saya · May 22, 2020, 8:38am

db.collection.aggregate([
  { 
      $group: { 
          _id: "$CustomerEmail", 
          count_new:  { $sum: { $cond: [ { $gte: [ "$OrderDate", ISODate("2020-04-01") ] }, 1, 0 ] } },
          count_repeat:  { $sum: { $cond: [ { $lt: [ "$OrderDate", ISODate("2020-04-01") ] }, 1, 0 ] } }
      } 
  },
  { 
      $group: { 
          _id: "Customers:",
          new_custs: { $sum: { $cond: [ { $eq: [ "$count_repeat", 0 ] }, 1, 0 ] } },
          repeat_custs: { $sum: { $cond: [ { $gt: [ "$count_repeat", 0 ] }, 1, 0 ] } }
      } 
  }
])

The output:

{ "_id" : "Customers:", "new_custs" : 1, "repeat_custs" : 2 }

Matt_2x · May 22, 2020, 9:09am

Prasad,

[Note: EDITED. I did indeed say the 100% wrong thing prior to this edit, my bad, apologises]
repeat_customers are customers that ordered prior to april and in april.
new_customers are customers that order in april, but have not ordered prior.

To be clearer lets try question “What amount of orders in April 2020 were from new customers and what amount from repeat customers?”

For stage 1 the output can be something like this:
New customers April (has no orders prior to 1st april): 13,343
Repeat customers (ordered in april, has at least 1 order prior to 1st april): 3,423

Thanks

Prasad_Saya · May 22, 2020, 9:13am

Hi Matt Freeman,

What percentage of orders in April 2020 were from new customers?

The following aggregation returns the percentage:

db.collection.aggregate([
  { 
      $group: { 
          _id: "$CustomerEmail",
          count_new:  { $sum: { $cond: [ { $gte: [ "$OrderDate", ISODate("2020-04-01") ] }, 1, 0 ] } },
          count_repeat:  { $sum: { $cond: [ { $lt: [ "$OrderDate", ISODate("2020-04-01") ] }, 1, 0 ] } },
      } 
  },
  { 
      $group: {
          _id: null,
          total_april: { $sum: "$count_new" },
          total_new_custs: { $sum: { $cond: [ { $eq: [ "$count_repeat", 0 ] }, "$count_new", 0 ] } },
          total_repeat_custs: { $sum: { $cond: [ { $gt: [ "$count_repeat", 0 ] }, "$count_new", 0 ] } },
      }
  },
  {
       $project: {
             _id: 0,
             new_cust_orders_percentage: { $divide: [ { $multiply: [ "$total_new_custs", 100 ] }, "$total_april" ] }
       }
  }
])

The output:

{ "new_cust_orders_percentage" : 66.66666666666667 }

I used these as sample documents:

{ OrderId: 1, CustomerEmail: "e-1", OrderDate: ISODate("2020-05-22"), ProductOrdered: "p-1" }
{ OrderId: 2, CustomerEmail: "e-2", OrderDate: ISODate("2020-03-25"), ProductOrdered: "p-91" }
{ OrderId: 3, CustomerEmail: "e-2", OrderDate: ISODate("2020-04-25"), ProductOrdered: "p-90" }
{ OrderId: 4, CustomerEmail: "e-1", OrderDate: ISODate("2020-05-20"), ProductOrdered: "p-0" }
{ OrderId: 5, CustomerEmail: "e-3", OrderDate: ISODate("2020-02-01"), ProductOrdered: "p-66" }

Matt_2x · May 22, 2020, 10:05am

Sorry I am making a real mess of explaining this,

We are interested in the order stats for the month April 2020.

A new customer is a customer that ordered in April 2020 but has not ordered previously before April 2020.
A repeat customer is a customer that ordered in April 2020 and has ordered at some point previous to April 2020.

A customer should only be counted once.

Matt_2x · May 22, 2020, 9:04pm

@Parasad.

I think this looks right now, thank you very much, greatly appreciated.

So it seems mongo query language is suited to representing complicated questions, I struggled to model this with (Incomplete) MBQL Reference · metabase/metabase Wiki · GitHub or other non-SQL-string query DSLs

At this point I should be completely honest and say that I may not be intending to use the query language against Mongo, but to just borrow the concept and apply it to our non-mongo datastore.

Long shot, for inspiration is anyone aware of any other DSLs that serialize neatly to JSON for representing queries of such complexity.

Thanks