Hi guys.I use bucket pattern for timeseries data.I use this code for importing my data into my table :
for file in sorted_files:
df = process_file(file)
for row,item in df.iterrows():
data_dict = item.to_dict()
mycol1.update_one(
{"nsamples": {"$lt": 288}},
{
"$push": {"samples": data_dict},
"$inc": {"nsamples": 1}
},
upsert=True
)
The problem is that the insert is very very slow.Is there any way to get things done faster?Is there a way to do this with bulk insert?Thanks in advance guys!
Well first you need an upsert command it has its own in bulk.
Now you need to do the criteria in the find and in the upsert do the push. Accumulated bulk in the item for loop should be executed in outside the loop.
Essentially you build a bulk on client side and do the upsert after the loop avoiding the need to update per loop cycle…
@Pavel_Duchovny Thanks you for helping me.I cant do it on my own.If its possible can you write me with code what should i do?i know i am asking a lot but i am drowing on my own.
Why should i do that?If i keep it there i do bulk_write for each file…if i move it to the main loop i do one bulk_write in the end for all files?Is this why i should do the final bulk write after the main loop?Its optimal right?
Why should i check if its under 1000?in my collection its about 1,2m rows importing with banches of 12 so its about 90.000 operations if i understand right…but mongodb i think does the divide on its own.i mean if its 2000 for example it divides the group in half
And one last thing…if i use updatemany instead of updateone here
bulk_request=[]
for file in sorted_files:
df = process_file(file)
for row, item in df.iterrows():
data_dict = item.to_dict()
bulk_request.append(UpdateOne(
{"nsamples": {"$lt": 12}},
{
"$push": {"samples": data_dict},
"$inc": {"nsamples": 1}
},
upsert=True
))
result = mycol1.bulk_write(bulk_request)