Filtering and Sorting Regular Expressions with Compound Indexes

I’ve looked around the forums at similar topics and still don’t fully understand how this works. I thought I grasped it and then Lab 2.2 threw me for a loop. After some investigation I think the piece I was missing is that RegExp() interacts with indexes differently than I had thought? Summary below:

At 2:40 of the “When you can sort with Indexes” lecture, Kirby says that “an index can be used to both filter and sort documents if the query includes equality conditions on all of the prefix keys that precede the sort keys.” For Lab 2.2, this led me to believe that the index { “address.state”: 1, “last_name”: 1, “job”: 1 } would have an in-memory sort for this query:

> db.people.find({
    "job": /^P/,
    "first_name": /^C/,
    "address.state": "Indiana"
  }).sort({ "last_name": 1 })

My thought was that, since there is no { “address.state”: 1, “job”: 1 } prefix, it would scan in-memory.
However, looking at the explain plan, it does not:

In contrast, an index that follows the order of the query { “address.state”: 1, “job”: 1, “last_name”: 1 } does do an in-memory sort:

If we run a query without RegExp() on the jobs field, however, we get our more efficient, no scan in memory with this second index:

So, what’s going on here? If the index is sorted by state, and then last name, and then job, doesn’t it need to filter the documents in that order first?

I’m trying to think of a good metaphor; if a book shelf is organized by color, author, and then title, and I take out all the blue books by Vonnegut, the titles should already be in order. But this doesn’t apply to “authors whose names start with V”, and it does apply to “books with titles that start with ‘The’”…?

Having a hard time wrapping my head around it. Any guidance would be appreciated.

I’ve reached the “Optimizing your CRUD Operations” lesson, and I can see the reasoning deals with RegExp() returning a range, so the index should typically follow the “equality, sort, range” standard to avoid an in-memory sort, but I still don’t understand how. I set up a table in Excel to help myself visualize and it doesn’t line up with this method. Let’s say I sort the table by color (red yellow blue), then last name, then first name, (i.e. the index would be { color: 1, last_name: 1, first_name: 1 }). It looks like this:

Now, if I follow the rule, color is my equality parameter, last_name is my sort, and first_name is my range. Let’s say my query, then, is to return everyone in the blue group whose first name contains the letter “o”, sorted by last name ( find({ color: “blue”, first_name: RegExp(’.o.’)}).sort({ last_name: 1 }) )

After filtering by color, I’m left with this:

Since these are in-order by first name, and not last name, won’t this require me to first re-sort my list (an in-memory sort)? Wouldn’t it be more efficient to first return the range and sort only the remaining three people?

The table you shared is not sorted by color because you have some yellow rows, a red row, a blue row, another red, 4 blue and the last one is red. To be sorted by color red yellow blue, you would have all the 3 red rows at the top then the 2 yellow and 5 blue at the bottom.

Your red rows at not sorted by last name, first name, the last name sorted order within red is nelson, pham then smith. Ditto for the blue rows, gibson, johnson, perry, turner then white is the proper last sorted order within blue.

The following

seems to be a contradiction with

So maybe it works differently in MongoDB, but if you do three sorts on the same table (or if you’re organizing a bookshelf), the sort that happens last is the one that will be most prevalent. I.e., the initial sort is partially overwritten by the latter two. If I have Bob Barker, Carl Anderson, and then Alice Anderson, during the first sort the order will be Alice, Bob, then Carl, and during the second it will be Alice, Carl, then Bob, because last name takes priority over first name as a higher order sort key. Saying the list was sorted first by first name, and then by last, and the list is thus ordered by last name now isn’t a contradiction, it’s just the consequence of sorting the entire list twice, though I can see how that sounds weird.

BUT, I think I’m getting it now. It doesn’t work like a table; it works like a tree (duh, MongoDB). Is that right? So if I had a bookshelf I would group the colors, then sort each group by author last name, then, IF any books had the same author last name, I would order them by first name. Which, in retrospect, is more how phone books were organized (if I’m remembering correctly, I haven’t seen one in an eon).

With that in mind, the ESR method makes way more sense. If I match the red group, I’ll have an ordered list of last names, which I can then filter down, and I’ll be left with an ordered list of first names. If I try to access the first names, I’m skipping a branch on the tree, creating an in-memory sort. Wow, thank you.

tl;dr It just now occurred to me that compound indexes in MongoDB being key-valued pairs means each level is embedded in the previous one, and you need to be on the correct level or your list will be out of order. Thank you @steevej for revealing this by making me explain myself, lol.

Follow-up: I think making this concept necessary to answer Lab 2.2 is confusing when it isn’t introduced until chapter 4. Guess I’ll follow-up on the lab

1 Like