Best schema design for storing chatbot conversations in MongoDB?

Hi everyone,
I’m working on a project where customers can interact with a chatbot. The idea is to save and manage their conversations using MongoDB, since the data is semi-structured and varies per user.

Current schema design:

I’ve set up two collections:

  • users or sessions collection: stores basic info like session_id, user_id, etc.
  • messages collection: stores one document per message, including metadata like timestamp, sender (user or bot), and session ID.

The reasoning behind this design was based on some recommendations I found in MongoDB’s O’Reilly book, especially around keeping documents small (under the 16MB limit) and treating messages as append-only data.

Queries we need to support:

  • Get full or partial user chat history (sorted by timestamp)
  • Delete a user’s chat history
  • Backup a user’s history to another collection (essentially just copy/paste data)

Concerns and questions:

Recently, someone challenged this design by asking:
“What happens when a user has a million messages?”

At first, I thought: “Just paginate or limit the query to the latest 30 messages.”
But they showed me an example where even querying just 30 out of millions still resulted in slow performance due to scanning large indexes or documents. That made me reconsider.

Since I’m relatively new to MongoDB and schema design, I’d love to hear:

  • Is having one document per message scalable in the long term?
  • Would embedding an array of messages inside the user/session document be better, even if that risks approaching the 16MB document limit?
  • Are there better patterns for handling chat history efficiently, like capped collections, time-series collections, or bucketing strategies?
  • Any tips for optimizing retrieval of recent messages while keeping writes efficient?

Any advice or examples from your own experiences would be super helpful. Thank you!