How to convert array elements to documents

Hi there!

I am looking for a performant method to convert array elements to documents.

  • The “fileds” delimiter in every string ( one array element) is “/”
  • The number of / (delimiters) is always seven
  • The first three “fields” represent a date (d-m-yy) day and month can have one ore two digits

Example input:

     "dataArray": [
              "27/1/17/81/1/0/0/abc",
              "27/1/17/149/2/10/0/sda",
              "1/2/17/6/2/2/1/eafd",
              "1/12/17/150/1/0/0/asdfg",
              "21/2/17/5/1/0/0/kl",
              "21/2/17/0/1/0/0/klkjjkj"]

Wanted output for the first array element:

 {
    "insertDate" : ISODate("2017-01-27T00:00:00Z"),
    "value1" : 81,
    "value1" : 1,
    "value1" : 0,
    "value1" : 0,
    "name1" : "abc"
  }

I thought of various methods but got stuck. One path was to unwind the dataArray and split the data stings to a new array. However this feels odd.

use('myDB');

const aggregation = [
{
    '$unwind': {
      'path': '$dataArray', 
      'includeArrayIndex': 'comp_dataset'
    }
  }, {
    '$project': {
      'comp_dataset': 1, 
      'comp_data': {
        '$split': [
          '$dataArray', '/'
        ]
      }, 
      '_id': 0
    }
  }}
];
db.myDB.aggregate(aggregation);

I unfortunately found no simple option to identify the substring of the first three fields like index of the 3rd occurrence of “/” to pass this to $dateFromString. The rest I would have split and $addtoset.
Normally @Asya_Kamsky 's pages like Stupid Tricks with MongoDB are a gold mine of tricks unfortunately I haven’t found a cool idea this time.

Any one who has done something like this before? I should mention that three are appr. 10 Mio documents with ~20 array elements each to process - so I stayed (until now) away from too complex constructs.

Thanks a lot :slight_smile:
Michael

Hi @Michael,

It feels like unwinding and grouping the arrays into an object is unavoidable because each one needs to form seperately.

I wonder if you tried $regexFind with grouping to match a part of the desired string and format it into a new value.

This sounds like a candidate to allow identify patterns whithin a string and then format it.

Finally a $project or $mergeObjects may morph the output to a single document.

I will need to work more to provide an actual code.

Thanks
Pavel

Hello @Pavel_Duchovny
thanks for you thoughts, I actually tried $regexFind and also $split plus $arrayElemAt passed to $dateFromString / $dateFromParts all of them seem to be a too long path to get a performing result. I am keen on new ideas to get me out of a “thinking tunnel” an review with a different perspective.
Regards,
Michael

Hi @Michael,

Refactoring $arrayElemAt and other operators into a function makes the pipeline little manageable. However, I’m not sure, if there is a way to get a substring once the delimiter count is met like you mentioned.

Here’s my attempt at the solution. There’s still plenty of room for improvement since I haven’t tested the performance aspect against millions of docs like mentioned in your use case.

Hope this helps.

getElem = function(index, year = null) {
    if (year !== null) {
        return {
            $toInt: {
                $concat: [year, {
                    $arrayElemAt: ["$dataArray", index]
                }]
            }
        }
    } else {
        return {
            $toInt: {
                $arrayElemAt: ["$dataArray", index]
            }
        }
    }
}

db.coll.aggregate([
    {
        $unwind: "$dataArray"
    },
    {
        $addFields: {
            dataArray: {
                $split: ["$dataArray", "/"]
            }
        }
    },
    {
        $project: {
            insertDate: {
                $dateFromParts: {
                    year: getElem(2, "20"), //assuming, 21st century
                    month: getElem(1),
                    day: getElem(0)
                }
            },
            value1: getElem(3),
            value2: getElem(4),
            value3: getElem(5),
            value4: getElem(6),
            name1: {
                $arrayElemAt: ["$dataArray", 7]
            }
        }
    }
])

Thanks,
Mahi

1 Like

Unwinding is unavoidable since you are going from an array element to document, but I don’t see why there would be any grouping involved.

@Michael here’s a possible start:

   db.myDB.aggregate([
             {$unwind:"$dataArray"},
             {$project:{f:{$split:["$dataArray","/"]}}},
             {$project:{date:{$dateFromParts:{year:{$add:[2000,{$toInt:{$arrayElemAt:["$f",2]}}]}, month:{$toInt:{$arrayElemAt:["$f",1]}}, day:{$toInt:{$arrayElemAt:["$f",0]}}}}}}
    ])
    { "_id" : ObjectId("6112eacebba8077232b072bd"), "date" : ISODate("2017-01-27T00:00:00Z") }
    { "_id" : ObjectId("6112eacebba8077232b072bd"), "date" : ISODate("2017-01-27T00:00:00Z") }
    { "_id" : ObjectId("6112eacebba8077232b072bd"), "date" : ISODate("2017-02-01T00:00:00Z") }
    { "_id" : ObjectId("6112eacebba8077232b072bd"), "date" : ISODate("2017-12-01T00:00:00Z") }
    { "_id" : ObjectId("6112eacebba8077232b072bd"), "date" : ISODate("2017-02-21T00:00:00Z") }
    { "_id" : ObjectId("6112eacebba8077232b072bd"), "date" : ISODate("2017-02-21T00:00:00Z") }

Now, you cannot have multiple fields named value1 in your output document but you can populate them the same way by getting the appropriate element out of the array that results from $split expressions.

Asya

1 Like

Hello @Asya_Kamsky
thanks for the starting point, I could work from there and currently optimize the resulting query since I had to cover some more surrounding conditions.
Until now I could not figure out why I get no ISODate back but a date object. I tried also with your example in an VS Playbook and got, surprisingly also a $date back.

"date": {
      "$date": "2017-02-21T00:00:00Z"
    }

I would like to get:
"date" : ISODate("2017-01-27T00:00:00Z") }

I believe this is a bug or limitation of plug-in. I don’t see this issue in the legacy shell…

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.