i need help on choosing a shard key in my mongodb sharded cluster.
Scenario My application is built on .net core 2.1. What it does is actually read websites and update details in the database. I’ve list of around 1 million websites which need to be crawled. The application just finds new pages which are not already in my database and saves them to database.
Cluster and Server Details I have 3 shards (one primary and 2 secondary each) on dell r820 machines. Each machine having 512gb of RAM. And i run my application on 4 dell r620 machines, its mutithreadrd application.
Database Structure: I have 2 databases mainly, one for all the home pages list and one for Pages.
URL (shard key)
URL (shard key and unique indexed to avoid duplicate entries in collection)
AlreadyRead (indexed field)
So the application reads home pages and saves the inner pages from home page in Pages database. And the other part of application gets Pages from Pages database where AlreadyRead is 0, updates it to 1 and crawls it to save other pages found on that page in the database. But this part takes time as the data size grows, which i think is because of wrong shard key as it is set to URL field, and the command goes on all shards (i am assuming). I am saving URL without http or www. And if i set the HomePageURL as the shard key, it unevenly distributes the data across clusters ( which i already experienced, it was having 92% of data on one cluster).
Cutting the long story short, cosidering the above scenario, what could be the best shard key? Or do i have to choose compound shard key?