Data structure for regular website monitor


I’m builidng some sort of website monitor with around 10k websites to monitor, roughly 2 times per month.
The monitor should start, run some tests on a batch of websites, maybe 100 per batch run, wait a min or so and run the next 100 batch tests.

In the collection: website_seeds I’ve got the list of all websites with some meta info.
The test results should be stored in website_results.

How would I data model / structure the data of the progress?

If I do sth like:

website_seeds.doc.status = new
start batch run
website_seeds.doc.status = processing
finished batch run
website_seeds.doc.status = completed

I’d need to “reset” all values to ‘new’, when I need to start the 2nd run.

So would that be reasonable to nest this info?

website_seeds.doc.run_log = [{'run_start': datetime, 'status': 'completed', ...},
{'run_start': datetime, 'status': 'processing'}]

In that way, I can check the most recent entry of all run_log.run_end fields and start a new batch after x days for the whole collection.

Is there a better aproach?

Based on your requirements, here’s a suggestion for structuring the data for your website monitor:

  1. Collection: website_seeds
  • Each document represents a website to be monitored.
  • Fields could include:
    • url: The URL of the website.
    • status: The current status of the website monitoring process (e.g., “new”, “processing”, “completed”).
    • Other relevant metadata about the website.
  1. Collection: website_results
  • Each document represents the test results for a website.
  • Fields could include:
    • website_id: A reference to the corresponding document in the website_seeds collection.
    • timestamp: The timestamp when the test was performed.
    • result: The result of the test (e.g., success, failure, error).
    • Other relevant data about the test results.

To track the progress and history of each website’s monitoring runs, you can add a nested field within the website_seeds document:

“run_log”: [
“run_start”: “2023-07-18T09:00:00”,
“run_end”: “2023-07-18T09:30:00”,
“status”: “completed”
“run_start”: “2023-07-19T10:00:00”,
“status”: “processing”

In this example, each entry in the run_log array represents a monitoring run. It includes the run_start timestamp, run_end timestamp (if available), and the status of the run. This allows you to track the history of each run and determine the most recent run.

To initiate a new batch run, you can check the latest entry in the run_log array for each website. If the latest run is completed or if a certain time threshold has passed, you can update the status field of all documents in the website_seeds collection to “new” to indicate that they need to be processed again.

By following this data model, you can easily track the progress of each website, store the test results separately, and have a history of the monitoring runs for reference.

Remember to adapt this model to your specific requirements and consider any additional fields or information that might be relevant to your monitoring process.