Return Specific 'Capture' Array Element from $regexFind

Hello everyone,

I’m using $regexFind with regex groups to match and return a specific value buried in a text string, this is a couple of example data objects:

  {
    "key": 1,
    "my_val": "me",
    "Attrib1": "sdf dfg me 1 dfg",
    "Attrib2": "sdf dfg we 2 dfg",
    "Attrib3": "sdf dfg pe 3 dfg"
  },
  {
    "key": 2,
    "my_val": "we",
    "Attrib1": "sdf dfg me 4 dfg",
    "Attrib2": "sdf dfg we 5 dfg",
    "Attrib3": "sdf dfg we 6 dfg"
  }
]

The objective is to find all the values/integers after " me " in the ‘Attrib’ strings. I have done this but it requires two steps to pull the single captured value:

  {
    "$addFields": {
      "parts": {
        "$regexFind": {
          "input": "$Attrib1",
          "regex": {
            $concat: [
              "(",
              "$my_val",
              ") ",
              "([0-9]+) "
            ]
          }
        }
      }
    }
  },
  {
    "$addFields": {
      "part1": {
        "$arrayElemAt": [
          "$parts.captures",
          1
        ]
      }
    }
  }
])

This works, I get a result for the first object as expected in the added “part1” field but I want to check all the ‘attrib’ strings so it would be helpful to be able to return ONLY the required captured group as either a string or a single element array.

I have tried various syntax to access elements of the returned object (from $regexFind) but none work. Is there a way to achieve the equivalent of this (which I know does NOT work and I shall use add $addFields but to make a point… and hence the question :smiley: ):

{ "$regexFind": {
          "input": "$Attrib1",
          "regex": {
            $concat: [
              "(",
              "$my_val",
              ") ",
              "([0-9]+) "
            ]
          }
        }
      }.captures[1]

Hopefully this makes sense and I have used the back-ticks correctly, any ideals welcome and please don’t dismiss the possibility that I’m missing something obvious/basic/simple! Thank you.

Hi @Chunky_Plumpy ,

Welcome back to MongoDB community

I am not sure I fully understand the requirements.

You can use a $project in the second stage to only output a single field and omit _id instead of add fields .

MongoDB aggregation return document cursor or document arrays it cannot return single values. Your client code can transform that.

Maybe a document example and desired output will help…

Thanks
Pavel

Hello Pavel,

Thanks for chiming in, and I suspect my question is a little confusing as I think my issue is more basic than I make it sound.

Simply put, I want to nest the $regexFind inside the $arrayElemAt inside $addFields rather than use them sequentially as I have done in the pipeline. The reason is purely because my objects actually have 100 string attributes on which I must check (using regex) their content to lift specific information. If the pipeline contains two or three steps for each attribute/check then the final aggregation query is going to be HUGE. Compacting the query by nesting might make it a little more manageable… maybe.

So I did achieve the desired outcome in two steps in the original post but was trying to do something like this:

db.collection.aggregate([
  {
    "$addFields": {
      "part1": {
        "$arrayElemAt": [
          
		  {
			"$regexFind": {
				"input": "$Attrib1",
				"regex": {
					$concat: [
					"(",
					"$my_val",
					") ",
					"([0-9]+) "
					]
				}
			}
		  }.captures,
          1
        ]
      }
    }
  }
])

Note the “.captures” which I simply tagged on the end of the embedded $regexFind section as if it was an object. This is invalid syntax but I hope explains what I’m trying to do?

Cheers,
Martin.

Hi @Chunky_Plumpy ,

Sorry I am still confused by your explanation.

Perhaps an example document will be at help.

If the reason you might have 100 stages is because you have 100 fields representing a 100 attributes you might benefit from using a different data model like the attribute pattern:

With this pattern all attributes are nested in attributes.value fields and therefore one stage can match them or you can $unwind and get each attribute as a sperate document

I can not offer other as advice as I don’t fully understand the problem.

Thanks
Pavel

Hello @Pavel_Duchovny ,

Thanks for hanging in here with me, I can explain. The data is extracted from an old telephony device and sadly we maintain the database structure as it is required when pushing the same data back to the device; for the moment I’m stuck with it.

The data structure should paste into playground, it is basically the one at the start of the thread but this time with slightly more realistic content (there are really 100 vector steps plus 30 other attributes):

db={
  "Vector": [
    {
      "vectorId": "10",
      "find_Me": "skill",
      "vectorStep1": "wait-time 10 secs ringback",
      "vectorStep2": "goto step 1 unconditionally",
      "vectorStep3": "goto step 2 conditionally"
    },
    {
      "vectorId": "20",
      "find_Me": "skill",
      "vectorStep1": "wait-time 11 secs ringback",
      "vectorStep2": "goto step 4 unconditionally",
      "vectorStep3": "queue-to skill 50 m"
    }
  ]
}

I would very much like to achieve the creation of a field in my result object which is the VALUE after the string match “skill”, however I can only build syntax to return a whole OBJECT from the $regexFind function… I want only a component of the returned object. So this is the working query that extracts the OBJECT (it runs):

db.Vector.aggregate([
  {
    "$addFields": {
      "parts": {
        "$regexFind": {
          "input": "$vectorStep3",
          "regex": {
            $concat: [
              "(",
              "$find_Me",
              ") ",
              "([0-9]+) "
            ]
          }
        }
      }
    }
  }
])

The result is actually two objects but only one matches as expected, giving the following result for the object where we get a match (I am getting this result):

{
    "_id": ObjectId("5a934e000102030405000001"),
    "find_Me": "skill",
    "parts": {
      "captures": [
        "skill",
        "50"
      ],
      "idx": 9,
      "match": "skill 50 "
    },
    "vectorId": "20",
    "vectorStep1": "wait-time 11 secs ringback",
    "vectorStep2": "goto step 4 unconditionally",
    "vectorStep3": "queue-to skill 50 m"
  }

The problem is I have grabbed the whole OBJECT returned by $regexFind and assigned this to the new field called “parts”. But I only want the numeric string value which is in the “captures” array inside it. So the result I WANT form a single command is (I really want this):

{
    "_id": ObjectId("5a934e000102030405000001"),
    "find_Me": "skill",
    "parts": "50",
    "vectorId": "20",
    "vectorStep1": "wait-time 11 secs ringback",
    "vectorStep2": "goto step 4 unconditionally",
    "vectorStep3": "queue-to skill 50 m"
  }

To achieve this I would need some sort of syntax to reference the field “captures” in the OBJECT being returned from the $regexFind function and then index into the ARRAY element of it to get to the required entry (1) when assigning to the new field.

Is this possible in a SINGLE pipeline step, can I execute the $regexFind function and extract only a single field, or better still a single element of the “captures” array within it and achieve my goal of assigning ONLY the value “50” that appears in my string ??

Cheers,
Martin.

Ok @Chunky_Plumpy ,

Hopefully now I understand it better.

So for just projecting the 50 value you could probably do the following:

db.Vector.aggregate([
  {
    "$addFields": {
      "parts": {
        "$regexFind": {
          "input": "$vectorStep3",
          "regex": {
            $concat: [
              "(",
              "$find_Me",
              ") ",
              "([0-9]+) "
            ]
          }
        }
      }
    }
  },
  {
    "$addFields": {
      "parts": {
        "$arrayElemAt": [
          "$parts.captures",
          1
        ]
      }
    }
  }
])

This overwrite “parts” with the second array element:

[
  {
    "_id": ObjectId("5a934e000102030405000000"),
    "find_Me": "skill",
    "parts": null,
    "vectorId": "10",
    "vectorStep1": "wait-time 10 secs ringback",
    "vectorStep2": "goto step 1 unconditionally",
    "vectorStep3": "goto step 2 conditionally"
  },
  {
    "_id": ObjectId("5a934e000102030405000001"),
    "find_Me": "skill",
    "parts": "50",
    "vectorId": "20",
    "vectorStep1": "wait-time 11 secs ringback",
    "vectorStep2": "goto step 4 unconditionally",
    "vectorStep3": "queue-to skill 50 m"
  }
]

However, if you need to search dynamically fields it might be problematic without 100 stages+. We have a Altas search with Atlas instances that can index dynamically fields and mayeb the regex operator there can be off some help but not sure…

Best regards,
Pavel