Lab - $facets -> What is it about?

After spending some time on this:

Chapter 4: Core Aggregation - Multidimensional Grouping
Lab - $facets

I did quite a bit of trial and errors to come to the conclusion that the following pipeline should work, giving us the base information needed along with the final solution. And looking at how the problem is formulated, I still think it is the way to go. It also complies to the reformulation found here.

db.movies.aggregate([
    {$facet: {
      imdb:[
          {$match:{"imdb.rating":{$gte:9.3}}},
          {$sort:{"imdb.rating":-1}},
          {$project:{_id:0,title:1,"imdb.rating":1}}
      ],
      metacritic:[
          {$match:{metacritic:{$gte:100}}},
          {$sort:{metacritic:-1}},
          {$project:{_id:0,title:1,metacritic:1}}
      ],
      imdb_metacritic:[
          {$match:{
              "imdb.rating":{$gte:9.3},
              metacritic:{$gte:100}
          }},
          {$project:{_id:0,title:1}}
      ]
    }
  }
]).pretty()

I am aware the number of elements in imdb and metacritic is going slightly over 10, but this is OK because looking at the values for imdb.rating and metacritic there is no way to decide on which element to eliminate. And more importantly the result number of elements in imdb_metacritic is 0.

But this answer is not in the list of choices.

After spending more time, reading some of the many discussion posts on the subject, I see that a number of people agree that this problem is nothing but clear. I have seen some hints, supposedly leading to “the solution”, but since they did not made much sense to me, I didn’t follow on that path.

For example, adding this to the pipeline I described above:

{$match: { "imdb.rating": { $exists: true } }},
{$match: { "metacritic": { $exists: true } }},

Regardless of knowing if it leads to “the solution” or not.

I don’t see on which ground I would be allowed to require:

{$match: { "imdb.rating": { $exists: true } }}

in order to later find the top 10 in the metacritic based classification.

And by the same token, why would one be allowed to require:

{$match: { "metacritic": { $exists: true } }}

in order to later find the top 10 in the imdb.rating based classification.

If someone can make sense of this lab, I hope to get some feedback.

1 Like

Hi Michel_Bouchet,
you need to:

  1. define the input sets for the $facet stage;
  2. extract the first 10 records with the highest imdb.ratings;
  3. extract the first 10 records with the highest metacritic, and
  4. project the records which are common to each of the above extractions.

In fact, the $exists command is not needed here in order to check the existence and then gather the records with highest ratings. You can use the $gte command to do the test right away. Documents not having the imdb.rating and/or metacritic fields would not be considered.

Before using the $facet you must define the input documents or subset of documents that will feed the $facet stage. So you need to specify the range of values for imdb.rating and metacritic in order to satisfy steps 2 and 3. There is no indication of a specific range in the lab statement so naturally you must consider all that is greater than zero.

Having both sets of records defined (steps 2 and 3) you can now use the $facet stage to gather the top 10 records with the highest values for imdb.rating and metacritic.

The last stage of the $aggregation pipeline will be the stage to select (project) the records from both sets of 10 documents having the same movie title.

Hope my explanation can clarify your doubts.
Jaime :coconut: :crab:

Yes, thanks a lot, that clarifies things, I will take another look at the issue with this new light.
But that also makes it clear that the problem is poorly stated from start.

Your important piece of information «Documents not having the imdb.rating and/or metacritic fields would not be considered.» makes it a different problem than what is presented in the lab.
(I presume the meaning here is “imdb.rating and metacritic”, in other words the document should have both to be considered).

You mention that «the $exists command is not needed …can use the $gte command to do the test». But this is not the point, the way the problem is presented, nothing should be needed.

The lab is presenting one problem and expects the answer to another one.

Anyway. Thanks again for your kind help!

I totally agree with you, the problem was poorly stated. I posted a rephrase:

ORIGINAL STATEMENT:
How many movies are in both the top ten highest rated movies according to
the imdb.rating and the metacritic fields? We should get these results with
exactly one access to the database.

REPHRASED:
How many movies the list of the top ten highest imdb.rating and the list
of the top 10 highest metacritic are common to both lists? We should get
these results with exactly one access to the database.

Documents not having the imdb.rating and/or metacritic fields would not be considered by any comparison command ($gt, $gte, $lt, $lte, etc) if the fields doest not exist inside a document, this is the way MongoDB works, this is basic knowledge. To check the existence of a field you use $exists. For the sake of the problem you do not need to consider using $exists because it is implicit that there are many documents having those fields so you may use any comparison command.

:stop_sign: Tip: Always know your data, that is, before coding a solution, do some analysis of the data you will be working on.

This is good to know that we agree, I hope the problem is going to be reformulated at some point. But to be honest and if you allow me to make a comment about your rephrasing I think it is still ambiguous. By two points:

  1. You write «and/or», this is ambiguous. AND and OR are different. Here what you want is clearly AND. In other words the documents not having imdb.rating field AND metacritic field, are not considered. Or if you prefer a different way to say the same thing:
    Documents not having imdb.rating field OR not having metacritic field, are not considered.

  2. You write «movies the list of the top ten highest imdb.rating and the list
    of the top 10 highest metacritic». The problem here is «movies the list of the top ten highest imdb.rating» does not imply anything about having a metacritic field or not; and this is exactly what makes the problem ambiguous. And of course the same thing can be said about «movies in the list of the top 10 highest metacritic» (it does not imply anything about having a imdb.rating field or not).

Here is IMHO, a much less ambiguous formulation:

Taking into account only movies being both imdb-rated and metacritic-rated; how many are in the top ten highest for the imdb.rating and also the top ten highest for the metacritic rating? We should get these results with exactly one access to the database.

2 Likes

Please note that “Documents not having the imdb.rating and/or metacritic fields would not be considered” is not ambiguos in the context of the explanation I was giving, but it is for the purpose of reformulating the problem statement. The $gte command and many others will just not consider the documents that do not have the argument field.

At some point during the processing you will have documents listed with both fields, documents with one field and documents with the other field. The final stage is where you will exract the documents with movies in common for both lists.

Your formulation looks good.

Another point and it should be added to the notes of the lab, is that a secondary sort field is required within each $facet . Without the second sort, you may get an intersection of 0.

I found that either title: 1 or the other rating field worked with my sample data.

e.g.

{“imdb.rating”: -1, metacritic: -1} in one facet
{metacritic: -1:, "imdb.rating: -1} in the other facet

Else use title: 1 as the secondary sort field.

Maybe it has something to do with my data in the Atlas cluster?

Hi @NeilM,

I would recommend updating the pipeline to use these sorts and then looking at individual results, before adding the $setIntersection stage in the pipeline.

The use of title field makes the sort consistent and it’s easier to find the movie by retaining this field in the pipeline.

Please feel free to reach out if you have any questions.

Kind Regards,
Sonali

Hi @Sonali_Mamgain! Could you help me out with this Lab?
I’m not able to find any match with my pipeline.

[
  {
    $facet: {
      ratings: [
        {
          $match: { "imdb.rating": { $ne: "" } }
        },

        {
          $sort: { "imdb.rating": -1 }
        },

        {
          $limit: 10
        },

        {
          $project: {
            _id: false,
            title: true,
            "imdb.rating": true
          }
        }
      ],

      metacritics: [
        {
          $match: { metacritic: { $ne: null } }
        },

        {
          $sort: { metacritic: -1 }
        },

        {
          $limit: 10
        },

        {
          $project: {
            _id: false,
            title: true,
            metacritic: true
          }
        }
      ]
    }
  },

  {
    $project: {
      commonToBoth: { $setIntersection: ["$ratings", "$metacritics"] }
    }
  }
]

I admit that the question made the problem 10x harder to be solved. I could only figure it out with the help of these posts:

There are more than 10 identical top Metacritic entries, so which top 10 is not clear. The solution used a title sort to pick the top 10. There are also duplicate names in the imdb top 10 which would require further work to get to the distinct or unique top 10. Unfortunately the distinct operator is not available in the tier we are using for the course. I managed to force a distinct by using grouping by title name, then throwing out the counts and limiting the result set to the top 10 grouped names.

There is also an ambiguous feature of the intersection operator in that from the examples it is not obvious that it can operate over comparing arrays of JSON objects with the same key.

This is a wickedly hard problem given limited APIs in Atlas for students, ambiguous directions, and data sets that have problems. I would expect this difficulty in a Kaggle competition.

I think the Lab question is clear, but my doubt is the use of “$exists” and “$gt” in the $match stage.

When sorting the movies descending by imdb.rating I get a different list using $exists
the “answer” movie goes to position 19 therefore “there are 0 movies in common”

Using $gt:0 moves things around and the answer movie moves up to number 2 in the list. Giving the correct answer for the lab.
Does that make sense?

Got it!
The culprit is how you $match your docs. If you do it separately $match for imdb.ratings only you get a list, different than $match for imdb.rating AND metacritic.
So, the question should have something like “movies that have both ratings by imdb AND metacritic”. It’s kind of implicit, but still took a while.