Pymongo find_one in a loop broke with CursorNotFound

Robert_Alexander · April 22, 2023, 6:44am

Have a little python/pymongo program that essentially does the following:

for histarticlenumber in location["inarticle"]:
            mdbfilter = {"articleid": histarticlenumber}
            projection = {"articletitle": 1, "dateEdiISO": 1, "_id": 0}
            articledoc = artcoll.find_one(mdbfilter, projection)
            if articledoc is not None:
                edidate = datetime2yyyymmddstr(articledoc["dateEdiISO"])
                arturiref = articles_ns[
                    edidate + "-" + slugify(articledoc["articletitle"])
                ]

This morning found it had broken during the night with a CursorNotFound error.

I am confused since I am using find_one, which returns a Document (or None), not a cursor.

What’s happening? Thanks

Here’s the whole trace:

Traceback (most recent call last):
  File "/home/bob/code/mema/src/utilities/transcoders/mdb2rdf_loc.py", line 117, in <module>
    main()
  File "/home/bob/code/mema/src/utilities/transcoders/mdb2rdf_loc.py", line 98, in main
    loc_graph, locmentions_graph = build_two_rdf_graphs(
  File "/home/bob/code/mema/src/utilities/transcoders/mdb2rdf_loc.py", line 40, in build_two_rdf_graphs
    for location in loccoll.find(mdbfilter):
  File "/home/bob/.cache/pypoetry/virtualenvs/mema-NHKGf2LO-py3.10/lib/python3.10/site-packages/pymongo/cursor.py", line 1248, in next
    if len(self.__data) or self._refresh():
  File "/home/bob/.cache/pypoetry/virtualenvs/mema-NHKGf2LO-py3.10/lib/python3.10/site-packages/pymongo/cursor.py", line 1188, in _refresh 
    self.__send_message(g)
  File "/home/bob/.cache/pypoetry/virtualenvs/mema-NHKGf2LO-py3.10/lib/python3.10/site-packages/pymongo/cursor.py", line 1052, in __send_message
    response = client._run_operation(
  File "/home/bob/.cache/pypoetry/virtualenvs/mema-NHKGf2LO-py3.10/lib/python3.10/site-packages/pymongo/_csot.py", line 105, in csot_wrapper
    return func(self, *args, **kwargs)
  File "/home/bob/.cache/pypoetry/virtualenvs/mema-NHKGf2LO-py3.10/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1330, in _run_operation
    return self._retryable_read(
  File "/home/bob/.cache/pypoetry/virtualenvs/mema-NHKGf2LO-py3.10/lib/python3.10/site-packages/pymongo/_csot.py", line 105, in csot_wrapper
    return func(self, *args, **kwargs)
  File "/home/bob/.cache/pypoetry/virtualenvs/mema-NHKGf2LO-py3.10/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1448, in _retryable_read
    return func(session, server, sock_info, read_pref)
  File "/home/bob/.cache/pypoetry/virtualenvs/mema-NHKGf2LO-py3.10/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1326, in _cmd
    return server.run_operation(
  File "/home/bob/.cache/pypoetry/virtualenvs/mema-NHKGf2LO-py3.10/lib/python3.10/site-packages/pymongo/server.py", line 134, in run_operation
    _check_command_response(first, sock_info.max_wire_version)
  File "/home/bob/.cache/pypoetry/virtualenvs/mema-NHKGf2LO-py3.10/lib/python3.10/site-packages/pymongo/helpers.py", line 179, in _check_command_response
    raise CursorNotFound(errmsg, code, response, max_wire_version)
pymongo.errors.CursorNotFound: cursor id 7243068530859290909 not found, full error: {'ok': 0.0, 'errmsg': 'cursor id 7243068530859290909 not found', 'code': 43, 'codeName': 'CursorNotFound'}

Shane · April 22, 2023, 9:54pm

Hi @Robert_Alexander, you are correct that CursorNotFound is not possible in a simple find_one(). CursorNotFound can happen when the cursor is closed on the server side while the client is still in the middle of reading the results. One potential cause is that the server itself was restarted.

From the trackback I see that the error is coming from a find() call, not a find_one():

  File "/home/bob/code/mema/src/utilities/transcoders/mdb2rdf_loc.py", line 40, in build_two_rdf_graphs
    for location in loccoll.find(mdbfilter):

One way to make your application robust to this type of error is to catch CursorNotFound errors and retry the find():

# Retry twice.
for i in range(3):
    try:
        for location in loccoll.find(mdbfilter):
            process(location)
        break
    except CursorNotFound:
        if i == 2:
            raise
        continue

Kobe_W · April 23, 2023, 5:19am

Normally what can cause this exception?

I would actually prefer focusing more on “how to avoid/minimize that CursorNotFound exception” rather than “retry when that exception happens”. That’s because processing data takes time, especially with a large data set from a find query. Better to be done only once and successfully.

Robert_Alexander · April 23, 2023, 7:48am

Thanks Shane and Kobe.

As a matter of fact I had overlooked that the find_one() happened within a find() loop !!!

While I do understand what Kobe says, I am handling really messy data from an historical archive and some fields can a) not exists at all, b) exist but their content is null, c) exists but content is an empty string (unless I still habe to discover new ways of data inflicting pain on the poor programmer ;)).

So try/except with a logging of the problem while not ideal is good enough for me.

Thanks a lot

Robert_Alexander · April 23, 2023, 8:19pm

FWIW I went through your suggestions but it wasn’t enough.

What seems to be working so far was a multiple approach:

a) Declare a session

with client.start_session() as mdbsession:
        # Get the collection from the database
        mdatab = client[cfgdata["mongodb"]["mongodbname"]]
        loccoll = mdatab[cfgdata["mongodb"]["loccoll"]]

b) Loop with options

with loccoll.find({}, no_cursor_timeout=True, session=mdbsession).batch_size(
        100
    ) as cursor:

c) manage the try/except as above

Will discover tomorrow morning if all of this was enough ! Fingers crossed!

Shane · April 23, 2023, 9:15pm

Thanks for the updates here. Is it possible your original code was keeping a cursor idle for longer than 30 minutes? If so you are likely running into this issue: https://www.mongodb.com/docs/manual/reference/method/cursor.maxTimeMS/#session-idle-timeout-overrides-maxtimems

As the page explains, one workaround is to create an explicit session (start_session) and periodically refresh the session to make sure the cursor is not discarded due to inactivity.

Another workaround is to lower the cursor batch size so that the cursor is iterated more frequently.

If this is indeed the issue then you may wait to follow DRIVERS-1602 which tracks solving this issue.

Robert_Alexander · April 24, 2023, 6:14am

On the brink of tears Went to bed hopeful I had nailed it only to find that while I was sleeping my program failed yet another time.

Thanks for your hints, I believe they will help me, but I’ll try to clarify the structure of my data and program to see if our ideas are the right ones. Back to the basics

My python 3.10/pymongo prog is perusing a “locations” collection with mentions of geographical places. This “loccoll” collection is 36.337 documents with the following structure:

{
  "_id": {
    "$oid": "641ddac2685f9e1cbbc7324b"
  },
  "placename": "Fruitvale Station",
  "geonamesid": 5351154,
  "inarticle": [
    "2003287499",
    "2003224697",
    "2003224545",
    "2003219236",
    "2003288015"
  ],
  "timestamp": {
    "$date": "2023-04-12T14:23:21.241Z"
  }
}

Please note that each of the 36K loccoll documents has a ‘inarticle’ field with an array of strings. In the example above this only has 5 entries, but some of them might have 20-30.000 entries (meaning that certain places such as ‘Germany’ are mentioned in 25.000 news articles identified by the inarticle field item).

The second collection used in this program is ‘artcoll’ and holds around 120K documents that represent news articles of which we’re interested only about three of its fields “articleid” (matching those “2003288015” strings from the loccoll documents), “title” and “dateEdiISO” a date object for that article’s publication.

The following code in my main() defines the connection to MDB and the 'loccoll" (and ‘artcoll’ more about the latter in a short while) collections :

# setup MDB connection to WP articles
client = MongoClient(cfgdata["mongodb"]["mongouri"])
 # Start an explicit mongodb session
 with client.start_session() as mdbsession:
     # Get the collection from the database
     mdatab = client[cfgdata["mongodb"]["mongodbname"]]
     loccoll = mdatab[cfgdata["mongodb"]["loccoll"]]
     artcoll = mdatab[cfgdata["mongodb"]["artcoll"]]

and after a few lines main() calls the function I need to reconcile the locations (from the ‘loccoll’ collection) with the news articles (from the ‘artcoll’ collection) as follows:

mentionsgraph = build_mentions_graph(
            mdbsession=mdbsession, loccoll=loccoll, artcoll=artcoll, logger=logger
        )

The mentionsgraph function is the workhorse function and the one that dies perhaps after a few hours of running This is its declaration:

def build_mentions_graph(
    mdbsession: ClientSession,
    loccoll: Collection,
    artcoll: Collection,
    logger: logging.Logger,
) -> Graph:

This sessions codes as follows:

with loccoll.find({}, no_cursor_timeout=True, session=mdbsession).batch_size(
        100
    ) as cursor:  # this ensures cursor is closes when finished using it and limits batches
        for location in cursor:
            requests.get(HEALTHCHECKS_PING, timeout=10)
            geonames_slug = slugify(location["placename"])
            loc_uriref = locations_ns[geonames_slug]
            for errcounter in range(3):
                try:
                    for histarticlenumber in location["inarticle"]:
                        mdbfilter = {"articleid": int(histarticlenumber)}
                        projection = {
                            "articletitle": 1,
                            "dateEdiISO": 1,
                            "articleid": 1,
                            "_id": 0,
                        }
                        articledoc = artcoll.find_one(mdbfilter, projection)
                        if articledoc is not None:
                            #### perform my graph building ####
          break
    except CursorNotFound as nocursor:
                    logger.error(
                        "CursorNotFound %s while processing %s for article %s",
                        nocursor,
                        location["placename"],
                        histarticlenumber,
                    )
                    if errcounter == 2:
                        raise
                    continue
return mygraph

So my suspicion is that the program is looping as expected through the loccoll collection, but when it stumbles on a location that has tens of thousand of ‘inarticle’ strings and spends time fetching each one (to retrieve the title and publication date) from the artcoll collection then the cursor of the loccoll gets killed and I wake up with the program having being killed.

Any further suggestions?

I really appreciate your support. thanks a lot.

Robert_Alexander · April 24, 2023, 3:55pm

Update: fingers crossed. Been 5 hours and the prog is still running.

I guess what did hopefully make the difference is the refresh of the session every 10 minutes within the innermost loop (which goes through the 'inarticle" Array of article ids).

Again thanks a bunch!

Shane · April 25, 2023, 8:44pm

@Robert_Alexander it sounds like you should look into using $lookup to have the server join the documents. This would likely be many, many times faster than performing the lookup in Python using find_one:

Another way would be to perform a find() with all the inarticle ids (although still not as good as $lookup):

        articles = list(artcoll.find({"articleid": {"$in": [int(i) for i in location["inarticle"]]}}, projection))

You’ll also want to ensure you have an index on “articleid” to make sure the query is fast.

Robert_Alexander · April 26, 2023, 7:56am

Thanks a lot Shane. As of now the code works albeit slowish (but it is computing 36K * 128K documents)

Will look into your suggestions to speed up things.

Take care and ciao from Rome, Italy

Robert_Alexander · April 26, 2023, 3:18pm

The problem was that the article id in the art collection is an int32 while in the per collection is a string.

So the following pipeline works well in returning what I need, but the problem is that now the cursor seems to timeout even though my code should prevent it:

def build_per_mentions_graph(
    mdbsession: ClientSession,
    sessiontime: datetime,
    mdbsessionid,
    percoll: Collection,
    logger: logging.Logger,
) -> Graph:
    """
    Input the PER MDB collection
    Build the mentions graph from its data
    """
    mentionsgraph = Graph()
    with percoll.aggregate(
        [
            {"$unwind": {"path": "$inarticle", "preserveNullAndEmptyArrays": True}},
            {"$addFields": {"inarticle_int": {"$toInt": "$inarticle"}}},
            {
                "$lookup": {
                    "from": "art",
                    "let": {"inarticle_int": "$inarticle_int"},
                    "pipeline": [
                        {
                            "$match": {
                                "$expr": {"$eq": ["$articleid", "$$inarticle_int"]}
                            }
                        },
                        {"$match": {"title": {"$exists": True, "$ne": ""}}},
                        {
                            "$project": {
                                "_id": 0,
                                "title": 1,
                                "dateEdiISO": 1,
                                "articleid": 1,
                            }
                        },
                    ],
                    "as": "art_info",
                }
            },
        ]
    ).batch_size(100) as cursor:
        for person in cursor:
            if (
                datetime.now() - sessiontime
            ).total_seconds() > 600:  # does the MDBsession needs to be refreshed?
                logger.info("Refreshing MDB session")
                mdbsession.client.admin.command("refreshSessions", [mdbsessionid])
                sessiontime = datetime.now()
            ### do my conversions on the person object here ###
    return mentionsgraph

this function receives a MongoDB session object with its timestamp but apparently after the first 100 persons in the iteration it pauses then resumes but person is not pointing to valid objects so the code dies with exceptions.

system · May 1, 2023, 3:18pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.