Simple queries that run for hours

Chris_Wiggins · November 12, 2021, 2:47pm

Hello,

I have data split between two Mongo databases. One is relatively small and its working set fits easily in RAM. The other is much bigger, and its working set does not fit in RAM (yes, I know ideally it should!) Various queries are run against the databases, and when only the first is used, everything is fine. When the second is queried in parallel with the first, things are obviously a little slow due to the working set not fitting in RAM, but it’s not catastrophic. Over time, though, various queries pop up that run for hours. These queries are no different to thousands of queries that are being run continually and that usually execute in milliseconds. If I manually type these queries into Compass, they run in milliseconds.

I’m not sure if it’s relevant, but these queries have huge values for numYields (in the millions). I guess this is perhaps just a symptom of how long the queries are running for. They also have "ReplicationStateTransition": "w" in their locks section. Is this expected? It seems strange that a query should require a write lock, especially on ReplicationStateTransition?!

These queries start to appear only after a little while (10 - 30 minutes maybe?) and then gradually accumulate. I think they’re the cause of the number of read tickets also gradually reducing, until there are zero and I assume either the load of these queries (if they’re causing load?) and/or the lack of read tickets means everything slows down a lot.

Anyway got any idea what’s going on here?!

Thanks,

Chris.

Chris_Wiggins · November 12, 2021, 2:47pm

I should have added: there are indexes for these queries, and the slow queries show they are being used. So it’s not a simple table scan issue.

kevinadi · November 18, 2021, 1:42am

That is a big red flag to me. In MongoDB, many parts have what is so-called “yield point”. That is, if a process is waiting for something, it will yield control to other processes to go ahead. Usually, this yield point is when it’s waiting for e.g. disk access. Millions of yields seems to indicate that the disk cannot serve data fast enough, so the server is left twiddling its thumbs waiting.

An indexed query in an overworked system wouldn’t be fast as well it’ll still be generally faster than a collection scan though.

What I’m guessing happened is that the server is way overloaded so it just cannot reply in time. When a query cannot be returned for hours, it may be a particulary bad combination of bad situations. It will be worse if the server is also shared with other resource-intensive processes, e.g. another MongoDB server, or an application server.

The easiest way to mitigate this is to provision larger instances for the overworked server. Another method is to ensure that the query is efficient (e.g. number of returned documents vs. number of index scan is close to 1:1). Having said that, there’s really no cure if the hardware is incapable of doing the work it’s being asked to do

Best regards
Kevin

Chris_Wiggins · November 19, 2021, 5:47pm

Thanks for that info, very interesting

I think this is very likely the case – the server is definitely taking quite a hammering! Something I forgot to mention in my first post, though, and that I don’t understand at all, is that these long running queries seem to continue for hours after I stop accessing the huge database (and go back to just accessing the one that fits in RAM and is normally no probem at all for Mongo). Surely the servers should recover very quickly? Eventually this does happen with no further intervention: the long queries stop, and Mongo recovers.

kevinadi · November 25, 2021, 1:04am

Hi @Chris_Wiggins

Although I can’t really say anything about specifics, disconnecting the client from the server won’t stop the processes already set to run in the server. These would need to be killed individually using something like db.killOp()

Glad you found the root of the issue though! It’s also great that once those expensive processes stopped, the server was able to recover. Definitely overworked, that server

Best regards
Kevin