How to get query output as nested document using aggregation stages

Hi All,
I am trying to write a query for nested documents but am not able to get the desired output. If someone can guide me it will be very helpful.

I have 4 collections names user, docs, pages, paragraphs
user :
{_id: "1234abcd", "numofdocs": 2, "date": "02/02/2022"}

docs:

{{"_id" : "d1", "user_id": "1234abcd", "doc_name" : "a.pdf"},
 {"_id" : "d2", "user_id": "1234abcd", "doc_name" : "b.pdf"}}

pages:

{
{"_id" : "d1p1", "doc_id" : "d1", " page_size" : [540,860]},
{"_id" : "d1p2", "doc_id" : "d1", " page_size" : [540,860]},
{"_id" : "d2p1", "doc_id" : "d2", " page_size" : [545,865]},
{"_id": "d2p2", "doc_id": "d2", " page_size" : [545,865]}
}

paragraphs:

{
{"_id" : "ap1", "page_id" : "d1p1", "text" : " hello a "},
{"_id" : "ap2", "page_id" : "d1p1", "text" : " hello b "},
{"_id" : "ap3", "page_id" : "d1p2", "text" : " hello c "},
{"_id" : "ap4", "page_id" : "d1p2", "text" : " hello d"},
{"_id" : "bp1", "page_id" : "d2p1", "text" : " hello  e"},
{"_id" : "bp2", "page_id" : "d2p1", "text" : " hello  f"},
{"_id" : "bp3", "page_id" : "d2p2", "text" : " hello g "},
{"_id" : "bp4", "page_id" : "d2p2", "text" : " hello h "}
}

and my desired output

{_id: "1234abcd", numofdocs: 2, date: "02/02/2022", 
"doc": [{"doc_name": "a.pdf", "pages": [ {"page_size": [540,860] , "paragraph": [{"text":"hello a"},{"text":"hello b"}]},
 {"page_size": [540,860] , "paragraph": [{"text":"hello c"},{"text":"hello d"}]}]},
{"doc_name" : "b.pdf", "pages": [ {"page_size": [545,865] , "paragraph": [{"text":"hello e"},{"text":"hello f"}]},
 {"page_size": [545,865] , "paragraph": [{"text":"hello g"},{"text":"hello h"}]}]}]
}

so far i have tried with 5 stages like match, lookup, unwind, lookup, unwind, lookup, project
but not getting in desired format…i am getting 8 documents in which each paragraph text is attached with page and doc information which is repeating in every document.

It always help to understand a document structure when there is a little bit of formatting. This gives the following for your desired output.

{	"_id" : "1234abcd",
	"numofdocs" : 2,
	"date" : "02/02/2022", 
	"doc": [
		{	"doc_name": "a.pdf",
			"pages" : [
				{	"page_size": [540,860] ,
					"paragraph": [{"text":"hello a"},{"text":"hello b"}]
				},
				{	"page_size": [540,860] ,
					"paragraph": [{"text":"hello c"},{"text":"hello d"}]
				}
			]
		},

		{	"doc_name" : "b.pdf",
			"pages": [
				{	"page_size": [545,865] ,
					"paragraph": [{"text":"hello e"},{"text":"hello f"}]
				},
				{	"page_size": [545,865] ,
					"paragraph": [{"text":"hello g"},{"text":"hello h"}]
				}
			]
		}
	]
}

The solution involves $lookup with pipeline: with a initial $match on user.

match_user = { $match : { _id : "1234abcd" } }

/* build the list of paragraphs for pages */

lookup_paragraphs = { $lookup : {
	from : "paragraphs" ,
	localField : "_id" ,
	foreignField : "page_id" ,
	as : "paragraph"
} }

/* build the list of pages with their paragraphs for a documents */ 

lookup_pages = { $lookup : {
	from : "pages" ,
	localField : "_id" ,
	foreignField : "doc_id" ,
	as : "pages" ,
	pipeline : [
		lookup_paragraphs
	]
} }

/* build list of docs with their pages and paragraphs for users */

lookup_docs = { $lookup : {
	from : "docs" ,
	localField : "_id" ,
	foreignField : "user_id" ,
	as : "doc" ,
	pipeline : [
		lookup_pages
	]
} }

pipeline = [ match_user , lookup_docs ]

I left out the cosmetic $project to get the exact format you want.

1 - do not store dates as strings, dates as date compared to string takes less space, are faster and provide a rich API
2 - you would be better off storing your data in your desired output rather than the normalized SQL like tables and forgo completely the hierarchical $lookup.

1 Like

Thanks, @steevej for the response and yeah for formatting also, next time I will make sure to do formatting.
The solution you are telling me will require 4 aggregation operations and it will work but I am looking for single aggregation with multiple stages, I am working in MongoDB compass, so there this solution won’t work I guess if it does then please guide me on how to do it. And thanks for the suggestion.
I have tried with one aggregation and I got it almost but not exactly the format. Here is the mongo shell extraction code.

db.getCollection("user").aggregate(
    [
        {
            "$match" : {
                "request_id" : "02e97006-9d0d-41cc-84f6-185dbacafdd5"
            }
        }, 
        {
            "$lookup" : {
                "from" : "docs",
                "localField" : "_id",
                "foreignField" : "user_id",
                "as" : "doc"
            }
        }, 
        {
            "$unwind" : {
                "path" : "$doc"
            }
        }, 
        {
            "$lookup" : {
                "from" : "pages",
                "localField" : "doc._id",
                "foreignField" : "doc_id",
                "as" : "doc.pages"
            }
        }, 
        {
            "$unwind" : {
                "path" : "$doc.pages"
            }
        }, 
        {
            "$lookup" : {
                "from" : "paragraphs",
                "localField" : "doc.pages._id",
                "foreignField" : "page_id",
                "as" : "doc.pages.paras"
            }
        }, 
        {
            "$group" : {
                "_id" : "$doc.pages.doc_id",
                "request_id" : {
                    "$first" : "$user_id"
                },
                "data" : {
                    "$push" : "$doc"
                }
            }
        }
        }
    ])

It is giving me data as a list of pages not nested according to the desired output.
Please correct me where am I wrong or needs modifications. And all these 4 collections are in single db.

It is a single aggregation pipeline with a single access to the database but I use variable to express each stage since it is usually easier to understand. What make you think you call aggregate 4 times? My structured code is equivalent to the monolithic

db.getCollection( "user" ).aggregate( [
    { $match : { "request_id" : "02e97006-9d0d-41cc-84f6-185dbacafdd5" } } ,
    { $lookup : {
        from : "docs" ,
        localField : "_id" ,
        foreignField : "user_id" ,
        as : "doc" ,
        pipeline : [
            { $lookup : {
                from : "pages" ,
                localField : "_id" ,
                foreignField : "doc_id" ,
                as : "pages" ,
                pipeline : [
                    { $lookup : {
                        from : "paragraphs" ,
                        localField : "_id" ,
                        foreignField : "page_id" ,
                        as : "paragraph"
                    } }
                ]
            } }
         ]
    } }
] ) 
1 Like

Thanks a lot @steevej it worked and got the output.

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.