Distribute a series of IDs in a Unique Random order in the most simple way possible

Tyler_Jensen · June 9, 2020, 11:08pm

I have been on the search for a solution to this problem for 5 years with MongoDB. I just watched all the new announcements and I got pretty pumped with all the awesome features, but nothing came close to solving this issue. I decided its time to put it here so maybe it can be put on your radar for future updates… Fingers crossed!

The problem: We work in a very competitive industry. We give out quotes via our software with a random 8 digit quote number. We do this so our competitors can’t figure out how many quotes we give per day by getting a new quote every 24 hours and seeing how many where issued. These IDs have to be unique, never used before.

The Challenge: Distribute a series of IDs in a Unique Random order in the most simple way possible.

There are 3 solutions to this problem in MongoDB. All three suck.

Solution 1:

Pre-insert every ID into the Quotes collection as tiny records {_id: 55555555}, {_id: 55555556}, {_id: 55555557}, etc. Have logic to pick a random number, select the first unused record. For example random number: 55555554. Run query: {"_id": {$gte: 55555554}, “name_field”: {"$exists": false}}. Once a record is found, update it with the new quote data.

This is not a good solution because 1. you have to pre-create 100000000 records. 2. each time you create a new quote you are doing an update in place where the size of the document gets much larger, so MongoDB has to move the document to the bottom of the collection on disk, which is really not good.

Solution 2:

Mine the new ID out of previously inserted quotes. You could use a do a series of DB queries to figure out a new ID by fetching a portion of data out of the collection and looking for an ID gap. You could fetch a range of 1000 records. Lets say 55555555-55556555 and count how many records are found. If you the answer is not 1000, then perform tons of queries figuring out what ID can be used.

There are different ways you can do this idea, but the gist is multiple successive full loop query calls. Super painful. At least you don’t have to insert 100000000 first… progress?

Solution 3:

Maintain 2 different collections. One that stores the remaining IDs. Like: {_id: 55555555}, {_id: 55555556}, {_id: 55555557}, etc. The other that stores the Quotes. Picking a random number like: 55555554, you can run a query like: {"_id": {$gte: 55555554}}, pick a new ID record. Use this ID to create a new record in the Quotes collection, then remove it from the IDs collection. You could use a transaction to make this a bit safer.

The problem with this is you are still pre-creating 100000000 records. If you do a big import into the Quotes collection from a csv or something, this IDs collection can fall out of date with the Quotes collection. And lastly you now have 2 collections to maintain. Every time you login to your DB you are reminded that you are a failure…

Sadly this is the solution we have been using for the last 5-6 years. It sucks, I am always on the look for a better idea / solution!

Please note, this is not just a MongoDB problem. Every DB software I know has this issue. Solution 3 is the winner for Relational DBs on the internet where this issue is discussed. I just keep hoping that MongoDB is going to pull though with a feature that can bail me out of my sadness.

Does anyone have any ideas? Thanks for your support and for taking the time to read this insanely long post. Maybe a new feature??

Justin · June 10, 2020, 12:48am

Why not use an ObjectID? It’s 12 digits instead of eight but it provides uniqueness and provides no hints about how many previous invoices were generated.

Really, anytime time-based should work. An ObjectID starts with a timestamp, a timestamp is monotonic so it will always be unique per second, and any competitor generating a new ID will only get information about what time they generated the quote. The same is true of using an 8 digit timestamp.

Tyler_Jensen · June 10, 2020, 3:02am

An objectID is 12 bytes, 24 chars in length.

My client is adamant the ID be number only and as short as possible. Representing time would not work under these circumstances unfortunately as after a short amount of time it will loop back over and we are stuck with these problems again.

Clients seem to have a knack to coming up with hard problems accidentally. lol

steevej · June 10, 2020, 10:22am

Is there any reason why quotes id have to be unique between all customer?

You could simply keep a quote count per customer and simply increase that number when a customer request a new quote. Ids would be unique for a given customer. This way your competitors could not figure out the number of quotes you give. The quote collection will have a compound index quote-id:1,customer-id:1.

Tyler_Jensen · June 10, 2020, 5:24pm

The customer service department want to ask the customer for a single number to pull up their quotes. Asking for 2 numbers increases complexity and confusion. If the customer number is incrementing then we have a similar problem as you can figure out how much new business there is per day by getting 2 quotes 24 hours apart.

steevej · June 11, 2020, 11:23am

Does this mean a customer does not have to authenticate or identify him self? That high security risk since if you get that single number your free to call customer service and get access to the quote.

I assumed that once a customer has a customer number, then it is fixed. To get a new customer number the competition will have to generate a new identity every time. Might be hard to do.

Tyler_Jensen · June 11, 2020, 4:21pm

Once the customer service rep pulls up the quote they are asked some questions about the quote to verify that this is the customers quote. This is no different than an order number at a eCommerce provider. Same policy.

steevej · June 11, 2020, 4:56pm

This seems to be a contraction.

and

But who are we to argue with the customer service department. B-) I hope you will be able to satisfy them.

chris · June 11, 2020, 5:27pm

I think you are underestimating the elegance of this solution.
A: Your requirements have created an artificially limited resource. In lieu of changing this too.
B: Your second point about this data moving on disk. Where do you think the data goes with solution 3? Also since the introduction of wired tiger I believe this pattern is not expensive as you think.
C: I would make the find of an unallocated quote number grab n numbers. So you can quickly try with another quote id if there is a violation on insert(without another find).