Starting sync stucks on "…coping data" and never finishs

Hello,

Everything was working perfect and suddenly, after few days stopped. I can’t start sync anymore.

I have small amount of data in Atlas:
3 collections:
COLLECTION SIZE: 20.95KB, TOTAL DOCUMENTS: 95, INDEXES TOTAL SIZE: 40KB
COLLECTION SIZE: 7.03MB, TOTAL DOCUMENTS: 18730, INDEXES TOTAL SIZE: 1.64MB
COLLECTION SIZE: 6.68MB, TOTAL DOCUMENTS: 25452, INDEXES TOTAL SIZE: 1.68MB

In Realm, there is nothing special, 2 applications, production and development, simple schemas, basic foreign key relationships, nothing more.

Eveyhing was working perfect for few days, starting Sync was instant, on the new device first data initialisation take just seconds and suddenly it stopped 2 days ago. First I noticed that on the new device first sync take very very long, sometimes never finish. I tried to terminate Sync and configure new one, it didn’t help but I noticed that starting Sync never finish, it keep “…coping data” also I can see error:

Enabling Sync…approximately 25450/25450 documents copied
failed to insert change into history during initial sync of ns=‘XXXXXXXXX.Person’ after copying 0 documents: connection(XXXXXXXXX.mongodb.net:27017[-670584]) failed to write: context deadline exceeded

and after some hours I got alert that “Sync has stopped between Atlas and Realm. Fix this issue by attempting to restart sync for Atlas.”

I noticed also that Logical Size of my Atlas cluster grow suddenly from about 50MB to over 500 MB (but collection size is still same small, less than 20 MB). It seems that actual problem appears before this sudden grow, but Atlas created some additional internal data when trying start sync forever.

I tried to Terminate Sync, and create it again, same problem. Creating new app didn’t help as well.

Please help! :slight_smile:

Radek

2 Likes

I noticed one more thing, in MongoDB Comapss, I can see that __realm_sync DB is very big. Especially
client_history collection:
Documents: 90
Avg. Document Size: 5.3 MB
Total Document Size: 479,9 MB

So this is why cluster is so big now, is is safe and possible to clear it? And why it is so big?

One more update, I decided to delete all documents from __client_history to free storage space, but it seems original problem still exists. When starting sync it gets stuck on “…coping data” and similar error as before appears. Also in client_history already 2 big 5.4 MB documents appears, so I think after some time it will keep growing till 500 MB in few hours, already 5 big documents :slight_smile:

Any idea how to solve it?

Dropping __realm_sync doesn’t help but I noticed some more Errors in log:

integrating changesets failed: error inserting history entry batch for transaction batch ID 60743fa12713fa48e9c771de: (NoSuchTransaction) Transaction 1 has been aborted. (ProtocolErrorCode=101)

Write Summary:
{

  • “Person”: {*
  • “replaced”: [*
  •  "7c8a0247-2b56-48b3-97fb-d3cf4b134c52",*
    
  •  "5735639d-ee64-4355-b001-52e15d4a49d1",*
    
  •  "da43ebc6-5771-4bb0-9c63-3ada0245c4a1",*

@Radoslaw_Kubas So you should never modify or touch the contents of the __realm_sync database - unless instructed by support staff - those contents are needed in order for the sync service to function properly. By modifying the contents the sync history is now corrupted and the the sync service operation and will get all sorts of weird errors.

Can you now please terminate sync, wait 5 minutes, and then re-enable sync? If it still does not recover please open a support ticket.

To answer your question on why this database grows in size, it is because it keeps a history of sync based on the operations that occur, so that it can perform conflict resolution deterministically - you can read about it the changesets here:
https://docs.mongodb.com/realm/sync/protocol/#changeset

Hello @Ian_Ward

Thank you for your support. I followed your advice, and restarting sync, it looks like that in logs:

RequestID: [607559e62713fa48e9dfac90]

  1. OK 0ms SyncConnection Start
  2. OK 505ms SyncWrite
    Comment collection - 92 documents, 20.3 KB, avg. doc. size: 227.9 B
    [
    “Upload message contained 1 changeset(s)”,
    “Integrating upload required conflict resolution to be performed on 0 of the changesets”,
    “Latest server version is now 2”
    ]
  3. OK 99455ms SyncWrite
    Record collection - 25468 documents, 6.7 MB, avg. doc. size: 275.2 B
    [
    “Upload message contained 1 changeset(s)”,
    “Integrating upload required conflict resolution to be performed on 0 of the changesets”,
    “Latest server version is now 3”
    ]
  4. OtherClientError 139959ms SyncWrite
    Person collection - 18743 documents, 7.0 MB, avg. doc. size: 393.6 B
    integrating changesets failed: error inserting history entry batch for transaction batch ID 60755b322713fa48e9dfb3a8: (NoSuchTransaction) Transaction 8 has been aborted. (ProtocolErrorCode=101)
  5. OtherClientError 139959ms SyncSession End
    Person collection - 18743 documents, 7.0 MB, avg. doc. size: 393.6 B
    Ending session with error: integrating changesets failed: error inserting history entry batch for transaction batch ID 60755b322713fa48e9dfb3a8: (NoSuchTransaction) Transaction 8 has been aborted. (ProtocolErrorCode=101)
    [
    “Session was active for: 7m48s”
    ]
    Session Metrics:
    {
    “uploads”: 3,
    “downloads”: 1
    }
  6. OK 0ms SyncConnection End

it keeps trying to repeat Person sync , so logs 1,4,5,6 repeats, Person collection is still not initialised for sync.

It is M0 cluster, but it is just test environment and data are small. Last week everything was working perfect, such operations takes just seconds or less, but since Friday/Saturday when problems with sync appears it is much slower than before. Now, as you can see, SyncWrite take 505ms for 20.3KB collection with 91 documents, and 99455ms for 6.7 MB collection with 25468.

I laso noticed that since that time, even first sync on devices started to be very slow, before it takes few seconds, but recently (even if Sync is configured correctly) it never ends or on some random devices it ends with success but takes longer than few minutes.

Thank you in advance for any help!

@Ian_Ward one update, after over 1 hour and over 10 repeats of 1,4,5,6 at Apr 13 14:41:19+02:00

I got:

Request ID: [6075916dd65bcffb7b896559]

  1. OK 0ms SyncConnection Start
  2. OK 91567ms SyncWrite
    Person collection - 18743 documents, 7.0 MB, avg. doc. size: 393.6 B
    [
    “Upload message contained 1 changeset(s)”,
    “Integrating upload required conflict resolution to be performed on 0 of the changesets”,
    “Latest server version is now 30”
    ]

Finally seems that syncing Person collection takes less than 100 000 ms and was correct. :slight_smile: Afert this, about 1 minute later, I see in logs 1 more repeat of logs 4,5, (maybe any sync task started before this correct sync?) but it is the last time, and as I can see now on the test device it was possible download correct data on initial sync, but it still takes very long. Minutes, not seconds as before.

All this happened without any intervention on my site, it seems that just something happen with this M0 instance, it started to work so slow that even internal tasks like setting Sync are not possible, because of timeouts.

Can you share the URL of the Realm App experiencing this please so we can take a look on our end?

@Ian_Ward sure, I sent it to you in private message, just now.

@Radoslaw_Kubas So the sync does appear to be in a working state. The error with transaction aborted occurs when one UPLOAD attempts to write to a document that is currently being modified by an in progress write transaction. The write is aborted and then re-attempted at a later time. These errors are more common when doing an initial sync (terminate/re-enable) of a Realm Sync app because the sync service is translating all of the Atlas state into readable realm sync history operations. We do have plans to block uploads until initial sync has finished. Your initial sync is also taking a long time because you are on a M0 shared tier.

For now, please wait until your initial sync is complete before attempting to make any mutations on the Atlas cluster or connect with sync clients and upload changes.

Thank you for your support and advices. Is that normal that M0 performance is so low during recent few days? As I mentioned before, with same data few days ago initial client sync takes seconds, maybe 10-20, no more, now it at least few minutes, and sill I have some “unlucky” devices it never finish (same Android, some iOS). It is not possible to use it even in development. Is M2 much better or exactly same?

If performance testing is your goal then I’d recommend retesting on larger Atlas tiers. It does seem like the server is bottlenecked trying to create DOWNLOAD messages for your sync clients.

:slight_smile: My main goal is just development now, but it seems that something is sill not ok. I paused “production” sync , left only “development”, in case, removed app from all devices and fresh install it only on one. Still nothing… in logs it looks like that, it starting session again and again:

And on the device realm is still not initialised.

realm = await Realm.GetInstanceAsync(configuration);

never finish, also there is no

OnProgress = progress =>
or
Session.Error += (sender, err) =>

It must be something more than just low M0 performance, it is just one device, and les than 20MB of data to download…

What do you think?

1 Like

As private message I sent you just now detailed logs from client device, this part is problematic:

00:37:21.482592+0200|XXXXXXXXXX.iOS|[AWK] [REALM] [LOG] level: Debug log: Connection[1]: Timeout on reception of PONG message|
00:37:21.482768+0200|XXXXXXXXXX.iOS|[AWK] [REALM] [LOG] level: Info log: Connection[1]: Connection closed due to error|

Good news :slight_smile: Everything is working again :slight_smile: bad, that it starts to work itself, so it seems to be a problem on Mongo / Realm . Last night I left the app started in iOS sumulator to see logs and in the morning , about 6AM it suddenly start downloading initial data. I’ve checked just now and speed is as nice as before, first initialisation takes again few seconds. So somethings was fixed that time. I will send you those logs in private message.

Once again thank you for all suport!

@Ian_Ward Hello!

Bad news, old problem returned, everything was perfect for about a week and without any change in db, I think it started today I can see same problem on initial sync in logs as before:

level: Debug log: Connection[1]: Timeout on reception of PONG message|
level: Info log: Connection[1]: Connection closed due to error|

Last time problem disappears itselfs, but take few days… I hope it was possible to fine solution on your site last time.

Best regards and I hope you can help! Thank you in advance!

Can you please open a ticket with support so I can have engineering take a look?

This is M0 cluster, so I think it is impossible to open ticket for me.

There is a chat widget in the Realm UI that will allow you to talk with a Cloud Support Engineer who can then open a support ticket. Feel free to link to this forums post

Thank you very much!