Mongodb in Big Data Search

William_Lyu · March 27, 2023, 3:56am

I encountered a problem when I was working, but because I am a self-taught program and have no professional training, I have no way to judge whether there is a better solution, so I come up to ask everyone

System limitations:

Unable to access the cloud, it is all company secrets and cannot be placed on the cloud
Two computers are used for hardware backup, and one of them is used as Replica Set backup, which cannot be extended horizontally
The maximum data capacity is 144TB, and the maximum will not exceed this number(the latter covers the former)
The current data storage architecture uses the same Database, but in different Collections
Normal data use will only extract the current day or recent data and hand it over to the front-end for drawing or display, and will not change any data
Occasionally extract data up to one year and export it to Excel
Index has been used, and aggregate is used to limit the search, and the use of $in/$nin/$regex, etc., which will search the entire Collection for logic sub-searches is reduced.
The cache size has been configured to 50% of the system configuration
The system has used SSD hard disk, and the network communication is already 10G network

Problems encountered:
At present, Mongodb is running very normally when the traffic is low, and it is also working very well. The amount of data may overflow to 144TB after 3-5 years. The current concern is whether it will cause the search process. Mongodb’s Timeout or low search efficiency

Question:
Would like to ask if there is a way to improve the search efficiency?
For example: I put all the birthday categories in the birthday DB, and when I need to find the birthday, I go to that DB to find

Or will using Cluster/Sharding improve efficiency?

Thank you very much for your answers and help. I will keep every answer of yours in mind. Best wishes.

Satyam · March 30, 2023, 5:42am

Hey @William_Lyu,

Welcome to the MongoDB Community Forums!

Can you please explain this further? Like how are you measuring the slowness in performance or if a timeout is happening since this is not very clear at the moment.

There are a few ways to improve search based on what you described such as:

Index Optimization: You have mentioned that you’re using the index but typically, slow queries are generally not using any of the appropriate indexes. It’s important to ensure that appropriate indexes are created on the query fields to improve query performance. The MongoDB explain() method can be used to see if an index is being used for a particular query and to understand the query performance.
Sharding: Sharding can help distribute data across multiple machines and improve query performance by allowing parallel execution of queries.

Please note that the above points are suggestions only and the specific solution or combination of solutions will depend on a variety of factors, including the types of queries being executed, the data model, and the hardware and infrastructure available. It would be good if you can share your current schema design, any sample documents, and queries that you’re executing(since you have mentioned aggregation with the use of $in,$nin,$regex operators which can also impact performance) along with the indexing, the output of explain(‘executionStats’) and the expected outputs to further pinpoint exact solutions.
If this is not a possibility, I’m attaching some documentation and other useful links that you can go through:
Aggregation Optimization
Best Practices for MongoDB Performance
$regex and Index Use
Performance Tuning in MongoDB
Index Selectivity

Hope this helps.

Regards,
Satyam

William_Lyu · April 2, 2023, 6:51am

OMG, I’m very excited to see your reply
For me, it really means a lot, especially as someone who is self-taught in programming.
Sometimes I really feel at a loss for certain problems.

To the First problem:
Right now there is less than 10TB of data, Mongodb is working normally, but I’m not quite sure if there is a large amount of data,find() and aggregate() whether it’s right or wrong, it’s possible to operate smoothly as usually.

For example:
When I am searching for Data with time index , Does MongoDB search for data lead to poor system performance due to lengthy search times?

I’m sorry for the misunderstanding. This is a hypothetical question.

Second problem:
I’m currently using find() or is $match’s weather capital available index
For example:
When Mongodb initial , createIndex() time addition time : 1 index
After this time, the search for time with find() and aggregate() ,$match: …

However, I don’t know, it’s a good way to use find() and index

Extraordinary quetion:
are there any books or articles you would recommend for self-learners like me?
Currently, I have read through all the MongoDB tutorials, but there is little information about performance and scalability.

Thank you again, I’m really impressed!
I am forever in mind
Best Wish

Satyam · April 3, 2023, 7:17am

Hey @William_Lyu,

Regarding not knowing if find() is useful with an index or not, you can use explain output to analyze if your queries are using index or not and if there’s scope for further optimization of the index.
I have linked all the useful resources in my previous reply that should help you out with your first two questions.

Coming to books or articles, MongoDB Documentation is the best resource to know more about MongoDB. It is the most up-to-date resource of MongoDB. You can also learn more from MongoDB’s University Platform. It hosts a lot of free, amazing courses from basics to advanced topics that you can take up to increase your knowledge of MongoDB. You can also refer to MongoDB Developer Center which has all the latest MongoDB tutorials, videos, and code examples in different languages and tools.

Hope this helps.

Regards,
Satyam

William_Lyu · April 3, 2023, 8:19am

This is an example I use aggregate() and match:
at the begiing:
col.create_index([(“time”, 1)])
col.create_index([(“time”, -1)])

use aggregate:

for i in db.testCollection.aggregate(
[
{“$match”: { query }},
{“$sort”: {“time”: -1, “_id”: -1}},
{“$skip”: skip},
{“$limit”: limit},
{
“$project”: {
“_id”: {“$toString”: “$_id”},
“FileName”: 1,
“type”: 1,
“time”: {
“$dateToString”: {
“date”: “$time”,
“format”: “%Y-%m-%d %H:%M:%S”,
“onNull”: “”,
}
},
},
},
]
)

find() :
test = db.testCollection.find_one({“_id”: ObjectId(search[“ID”])}, {“_id”: 0})

Like I said , this is a statement I learned from the Mongodb document. It works great when the data volume is low. However, I am concerned that it may lead to long search times when the data volume is large. I don’t know if there is a better way to use the find() and aggregate() statements.

In the past few days, I also found the courses in Mongodb College while searching for documents, and I have started taking classes. The Mongodb community is really great, both in terms of your answers and learning resources. It is really helpful for beginners.

Thank for your reply. It’s help a lot to me.