Mongodb $lookup operator based on key name

Srdjan_Cengic · March 28, 2023, 11:56pm

Having two collections tables and table_rows like following:

tables = {
  "_id": "641ce65852a7ccd2f4a7b298",
  "name": "table name",
  "description": "table description",
  "columns": [{
    "_id": "641ce65852a7ccd2f4a7b299",
    "name": "column 1",
    "dataType": "String"
  }, {
    "_id": "641cf95543a5f258bfaf69e3",
    "name": "column 2",
    "dataType": "Number"
  }]
}

table_rows = {
  "tableId": "641ce65852a7ccd2f4a7b298",
  "641ce65852a7ccd2f4a7b299": "Example string",
  "641cf95543a5f258bfaf69e3": 101
}

That in a nutshell represent regular tabular data of 2 columns and 1 row like this:

column 1	column 2
Example string	101

Is there any way to use $lookup operator to join those two collection based on _id from columns array in tables collection and based on the key name from table_rows collection? What i’m trying is to somehow join columns definitions (name, datatype, etc) along with cell values.

As you can see actual key name in card_rows collection is _id of column itself.

Ideally this would be single collection, but those tables can grow to hundred of columns and 10K of rows, so it is modeled as two collections to avoid unbound arrays in mongo.

steevej · March 29, 2023, 1:59am

The use of data value (like columns _id) as field name (in your table_rows) is usually a bad idea. With the attribute pattern the table_rows collection will avoid using dynamic value as field name like in

table_rows = {
  "tableId": "641ce65852a7ccd2f4a7b298",
  "columns": [
      { "column_id" : "641ce65852a7ccd2f4a7b299", "value": "Example string" } ,
      { "column_id" : "641cf95543a5f258bfaf69e3", "value" : 101 }
  ]
}

But then why don’t you simply completely remove the columns _id and use another array for your table_rows? The size of the data and direct 1-to-1 mapping between columns definition and values would be much simpler, much faster and must smaller.

Like:

tables = {
  "_id": "641ce65852a7ccd2f4a7b298",
  "name": "table name",
  "description": "table description",
  "columns": [{
    "name": "column 1",
    "dataType": "String"
  }, {
    "name": "column 2",
    "dataType": "Number"
  }]
}

table_rows = {
  "tableId": "641ce65852a7ccd2f4a7b298",
  "columns" : [
      "Example string",
      101
  ]
}

Are you going to access a table without its table_rows?

Are you going to access a table_row without its table?

If you are going to $lookup from table_rows all the rows from a table whenever you access the table you will still risk to hit the maximum size of a document.

Srdjan_Cengic · March 29, 2023, 6:02pm

Thank you very much @steevej . You dont know how much i appreciated your answer. Being not that experienced with mongodb, for quite sometime im trying to find some good-enough solution that would cover standard operation that user can do with regular spreadsheet or table of data, including ability to support:

adding/removing/renaming columns
sorting per columns
reordering of columns
support large tables (like 50 columns, 10 thousands rows)

I have tried to seek some answers and shared my thoughts in this forum on following link, regarding how this could be designed in mongodb: Database design for tabular data (user defined columns with potentially lof of rows of data to maintain) - #3 by Satyam

Just to answer to your questions:

Are you going to access a table without its table_rows? => rarely, if ever
Are you going to access a table_row without its table? => rarely, if ever

Im totally aware that data that should be queried together should be part of same collection, and that was my first attempt as explained in above referenced question. My concern about that approach is that if everything is in same collection, more or less i will hit unbound array problem with mongodb, correct?

The reasoning behind, solution i proposed here where dynamic object id in card_rows collection is actually columnId, is that i need ability to sort, filter and do other aggregate operation on card rows.

Now looking at your proposed solution it looks much more better.
In your opinion with your proposed solution, would these operations like grouping, sorting, etc, work with elements of array and does this come with some drawbacks in your opinion?

Again, really really thank you for your time and for your answer.

steevej · March 30, 2023, 1:33pm

I saw the other thread and Satyam answer was more than appropriate.

For the record, I am an independent consultant. I do not design and code for free. I help, I point to resource, I bring questions to think about and I provide solutions to tricky aggregations. But I will not design and code for free. Why would my customer contract me if they can come here and have the work done for free?

Srdjan_Cengic · March 30, 2023, 2:03pm

A little bi surprised by your answer. I had absolutely nothing in mind more than just having discussion about two approaches, and seeking for other opinions as well, nothing more, which i guess community like this is all about. Anyway as i said earlier thank you very very much for your time and for your opinion, it helped really a lot. Really appreciated. Kind regards

steevej · March 31, 2023, 1:54pm

The goal of the following questions and answers

was that if you access everything together most of the time you still risk having a huge array. A $lookup simply build an array of documents. So if rows are too big to fit in a single document they will also be too big when you $lookup.

But there is nothing wrong about limiting the number of columns and rows. I am sure there are some in Excell and LibreOffice.

Doing sort and filtering in documents is probably more efficient since you might be able to define indexes while sorting and filtering an array will always be a memory sort.

Srdjan_Cengic · April 3, 2023, 8:26pm

Thank you @steevej once again really, really appreciated.

In the end based on some testing and different approaches and based on some very good inputs by you and similar threads, i will go with two collections. One to keep data about tables and column definitions, another one to keep rows of the tables. Some sort of attribute pattern as you mentioned:

tables =  [{
  _id: 1,
  tenantId: 1,
  name:  “Parking spots”,
  description:  “List of our parking spots”.
  columns: [{
    _id: 2,
    columnName: “parkingType”,
    order: 1
  }, {
     _id: 3,
    columnName: “numberOfSpots”,
    order: 2
  }],
]

rows = [
  {
    "_id": ...,
    "table_id": 1,
    "tenant_id": 1,
    "cells": [
      {
        "_id": ...,
        "columnId": 2,
        "value": "Garage",
        // other possible fields about cell itself
      },
      {
         "_id": ...,
         "columnId": 3,
         "value": 1000
      }
    ]
  },
  ...
]

With some nice compound index on { table_id, tenant_id} in rows, i have found that queries like sorting performs not that bad

db.rows.aggregate([
  $match: { tenant_id: 1, table_id: 1 },
  $sort: {
    "cells.0.value": 1
  }
]);

Once again really thank you for your time and answer.