Compound $search not explicitly matching array value in results

Having some issues with atlas $search and compound querying of faceted/arrays.

Take the following dataset:

[
  {
    "_id": "one",
    "category": "bear",
    "tags": ["202"],
    "text": "This is some text"
  },
  {
    "_id": "two",
    "category": "bear",
    "tags": ["202-L"],
    "text": "This is some text"
  },
  {
    "_id": "three",
    "category": "bear",
    "tags": ["204-L"],
    "text": "This is some text"
  }
  {
    "_id": "four",
    "category": "tiger",
    "tags": ["202"],
    "text": "This is some text"
  }
]

I’m using the above data with the following $search aggregate:

{
  index: "text",
  facet: {
    operator: {
      compound: {
        filter: [
          {
            text: {
              path: [
                "text",
              ],
              query: "some",
            },
          },
          {
            text: {
              path: "category",
              query: "bear",
            },
          },
          {
            text: {
              path: "tags",
              query: "202"
            },
          },
        ],
      },
    },
    facets: {
      tags: {
        type: "string",
        path: "tags",
      },

    },
  },
}

My problem is specifically around the tags field. I only want documents to be returned that explicitly match “202” but that isn’t what is happening.

{
     text: {
          path: "tags",
          query: "202" // or ["202"]
     }
}

The above returns document “one” and “two” but I would only expect document “one” to be returned. It seems that it is matching the “202” in the tag values and not acknowledging that “202” != “202-L”

If I change the filter to the following:

{
     text: {
          path: "tags",
          query: "202-L"
     }
}

It returns “one”, “two”, and “three” when I would only expect it to return document “two”. It appears to be matching the “202” and the “-L” across all documents.

I’ve read through the documentation and just can’t figure out what I am missing. How can I go about only matching explicit strings in an array and not partial values?

Hey @w3e,

Welcome to the MongoDB Community Forums! :leaves:

I replicated your sample documents on my end and used the following index definition:

{
  "analyzer": "lucene.whitespace",
  "searchAnalyzer": "lucene.whitespace",
  "mappings": {
    "dynamic": true
  }
}

I then used the following search query:

{
  index: 'default',
  text: {
    query: '202',
    path: 'tags'}
}

and only one and four were returned since they both have 202. Similarly, when I searched:

{
  index: 'default',
  text: {
    query: '202-L',
    path: 'tags'}
}

only document two was returned.

Hope this helps. If not, it would be good if you can share your index definition as well.

Regards,
Satyam

@Satyam,

First off, thank you for taking the time to give this a look.

I think what you have shared exhibits the issue perfectly. I’m using the default index type but I did play around with whitespace and keyword with similar behavior.

If I am dealing with tags, (or taxonomy), I would not be interested in substrings or each array string.

If I am querying “202” I would only expect/want results that match the full “202” string. I’m not interested in items tagged with “202-L” or “2023”.

Likewise, if I query for “202-L” I would not expect “215-L” or something different (I was seeing that the search was treating the “L” as its own word, returning erroneous results.)

Using non $search aggregators, it does not behave this way. If I have the aggregator as follows:

{
    $match: {
        tags: "202"
    }
}

I only get documents that match “202” not “202-L” or “2023”.

I understand this behavior for standard text searches, but if we are dealing with any type of facet, the results are flawed. This is because we need to look at the field’s full array item and not a substring of that item.

Hey @w3e,

Yes, I would advise you to play around and read the documentation on different search index types available as well as the various operators that Atlas offers to see which one would serve your use case best. Based on what you described, using lucene.whitespace might be of more use to you than the default lucene.standard.

Trying out different approaches is the best way to learn and should help you a lot. I suggest reading up more on other search operators as well as analysers and then deciding which one would be ideal for your use case.

Regards,
Satyam