What is the best way to query an array of sub-documents in MongoDB?

let’s say I have a collection like so:

 {
    "id": "2902-48239-42389-83294",
    "data": {
        "location": [
            {
                "country": "Italy",
                "city": "Rome"
            }
        ],
        "time": [
            {
                "timestamp": "1626298659",
                "data":"2020-12-24 09:42:30"
            }
        ],
        "details": [
            {
                "timestamp": "1626298659",
                "data": {
                    "url": "https://example.com",
                    "name": "John Doe",
                    "email": "john@doe.com"    
                }
            },
            {
                "timestamp": "1626298652",
                "data": {
                    "url": "https://www.myexample.com",
                    "name": "John Doe",
                    "email": "doe@john.com"    
                }
            },
            {
                "timestamp": "1626298652",
                "data": {
                    "url": "http://example.com/sub/directory",
                    "name": "John Doe",
                    "email": "doe@johnson.com"    
                }
            }
        ]
    }
}

Now the main focus is on the array of subdocument(“data.details”): I want to get output only of relevant matches e.g:

db.info.find({"data.details.data.url": "example.com"})
  1. How can I get a match for all “data.details.data.url” contains “example.com” but won’t match with “myexample.com”. When I do it with $regex I get too many results, so if I query for “example.com” it also return “myexample.com

  2. Even when I do get partial results (with $match), It’s very slow. I tried this aggregation stages :

  3. { $unwind: "$data.details" },
    
    {
      $match: {
        "data.details.data.url": /.*example.com.*/,
      },
    },
    {
      $project: {
        id: 1,
        "data.details.data.url": 1,
        "data.details.data.email": 1,
      },
    },
    

3. I really don't understand the pattern, with $match, sometimes Mongo do recognize prefixes like "https://" or "https://www." and sometime it does not.

More info: My collection has dozens of GB, I created two indexes:

Compound like so: "data.details.data.url": 1, "data.details.data.email": 1
Text Index: "data.details.data.url": "text", "data.details.data.email": "text"
It did improve the query performance but not enough and I still have this issue with the $match vs $regex. Thanks for helpers!

Hello @Alon_Gal, welcome to the MongoDB Community forum!

You can query using this aggregation:

db.collection.aggregate([
  { 
      $project: { 
          "data.details": { 
              $filter: { 
                  input: "$data.details", 
                  as: "e", 
                  cond: { $regexMatch: { input: "$$e.data.url" , regex: "/example" } }
              }
          }
      }
  }
])

This will return array sub-document field "data.details.data.url" with values "/example.com" (there are two matching elements in your sample document).

You are searching for a substring within a value, and you are using a regex match. Regex matches have some limitations using an index. From the topic Regex and Index Use:

For case sensitive regular expression queries, if an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan. Further optimization can occur if the regular expression is a “prefix expression”, which means that all potential matches start with the same string.

Further, you can use the explain method on the query to generate a query plan and find out the usage of indexes.

Funny thing about regular expressions - they match exactly what you tell them to match.

How would it know not to return myexample.com? How do you know it? I bet it’s because you see the "." or "/" before it, and so what you want to match is really .*\.example\.com.* or maybe .*\/example\.com.* (without the '\' '.' will just match any character.) By the way, .* before and after the regular expression is implicit so I think what you actually want is something like:

 /[\.\/]example\.com/

or if you only want strings that end like that, then:

/[\.\/]example\.com$/

As far as performance of your query, substring search is going to be slow because even with an index, the query has to check entire length of the string, so if you want fast query results consider storing the relevant part of the URL (like the domain) in its own separate field that you can query with simple equality, then indexes will be efficiently utilitized.

Asya

3 Likes

Thanks that appears to have solved it!