How to perform an operation on a field and then compare this result with provided data?

Gwen · June 30, 2023, 7:29pm

Hello,

I’m a new guy in the NoSQL world and I need to migrate an existing project on PostgreSQL to Mongodb.

I’m struggling with one specific query which is really not complicated on SQL side but I cannot manage to do the equivalent in noSql.

Could you help me please ?

The SQL query that I need to “adapt” to NoSql world

select * from transactions t where substring(t.voucherNumber,1,16) in (:voucherList)

As you can see, I store voucherNumbers in 30 characters in database but for a specific workflow, I received a list of vouchers on 16 characters and I need to find all the documents where the voucherNumber starts with one of the list.

It tried regex, clause IN etc, but I cannot manage to do it.

Any help would be really appreciated.

John_Sewell · June 30, 2023, 10:11pm

How about this:

db.getCollection("Test").aggregate([
{
    $match:{
        $expr:{
            $in:[
                {
                    $substrCP:['$voucherNumber', 0, 3]
                },
                ['ABC', 'ABF']
            ]
        }
    }
}
])

Gwen · July 1, 2023, 7:29pm

Thank you so much !!

turivishal · July 2, 2023, 10:13am

Hello @Gwen, Welcome to the MongoDB Community Forum,

You can use $regex operator with pipe | sign for or condition and ^ sign to check matching string from the start of the string,

db.collection.aggregate([
  {
    $match: {
      voucherNumber: {
        $regex: "^ABC|^ABF"
      }
    }
  }
])

@John_Sewell, if there is a regex operator then no need to use an extra expression operators to make this happen.

And the other side, There are some restrictions in $expr operator about index use, $expr can’t use an index with $in operator.

John_Sewell · July 2, 2023, 10:23am

True, in this case if it’s just the prefix we’re searching for a straight regex does make it easier, although you’ll need to form the regex string in code (which is trivial).

Be interesting to see the performance difference between the two though.

John_Sewell · July 2, 2023, 10:42am

I just created a quick collection with about 100,000 random voucher codes and ran the two approaches 10 times or so each.
Both come down to sub 200ms execution time (running on local workstation with lots of CPU and RAM, plus only one field in documents so the index fully covers the query in RAM)

So either way should be good, if you want to compare two fields against each other you’ll want to use the $expr but in this case as Vishal said you can just form a regex.

Note that unless you are using Atlas search indexes and doing something interesting, you need to anchor the start of a regex in order for Mongo to make use of an index when matching against it.

Gwen · July 2, 2023, 6:02pm

Thank you both for your replies.

I’m not sure to really understand what I should do concerning “Note that unless you are using Atlas search indexes and doing something interesting, you need to anchor the start of a regex in order for Mongo to make use of an index when matching against it.” ?

Gwen · July 2, 2023, 6:31pm

I don’t really have a preference between the two solutions.

The first solution of John is working well, and the one of Vishal works too (indeed the regex is trivial to build) and apparently there are no difference on performance side.

Gwen · July 2, 2023, 7:07pm

And I have another question concerning the same need : the provided list can contain a lot of vouchers to compare, something like 300 vouchers.

May be it can help to choose the solution between the two you proposed ?

turivishal · July 3, 2023, 3:57pm

Hi @Gwen,

How did you measure the performance? I mean if you are checking just taken time is less then it does not mean it performed well, understand the explain() method to check the exact view of the analytics of your query.

For more details. refer to the Performance best practices,

If you want to use the index in voucherNumber then I have already mentioned that in my previous post $expr with $in does not use an index it will do a collection scan, which means it will impact First Disk Speed, especially on a large collection, Second The size of your WiredTiger cache and the size of your overall RAM.

On the other side, you can refer to the documentation about indexing strategies

John_Sewell · July 3, 2023, 7:02pm

Vishal is correct, I didn’t notice that from prior use, running on a test collection, while fast on a large dataset it IS performing a colscan using $in embedded within the Expr. I’ve previously used this with other operators which WILL hit an index and so not noticed this restriction.

So I agree, in this scenario don’t embed the $in within the expr and just use a regex, and something to watch out for, until they add this functionality!

John