Refine Atlas Search for similar content

Hello everyone!

So I’m using the following query to detetct if a user as produced similar content in my chat channel to avoid having spamming users:

pipeline = [
                {
                    "$search": {
                        "index": "content_comparer",
                        "text": {"query": message, "path": "message"},
                        "sort": {"created_at": -1},
                    }
                },
                {"$limit": 5},
                {
                    "$project": {
                        "message": 1,
                        "released": 1,
                        "score": {"$meta": "searchScore"},
                    }
                },
            ]

However I’m finding the results to be to “agressive” since if I write content like “I love coffee” and then “My coffee is great” it would detect the content to be similar. What do you think I should no in this cases? Only show matches with a score above a certain value? If so which value is it ok to choose since I can’t find a way to understand how is the score calculated and what is the range of values available.

In order to provide more context if I receive a message “hello my friend” I want to detect content such as “heLLo My frienD” or “hello my friend1” or “hello my friens”. Using ``phrase` I was able to detect the first case but not the following ones. However if I typed “Hello my friend” and then tap “you are my friend” I don’t want this message to be flagged.

My atlas index is using lucene.standard for index and search analyzer I field mapped my message field as String Properties.

Can anyone help me with this?

Hi @Pedro_Silva1 and welcome to the community forum.

In order to help you further, could you help me with a few details which would help me reproduce the issue and help you with a possible solution. Could you share the following information like:

  1. A few sample data form the collection.
  2. The Atlas Search index created
  3. The expected response from the query being executed.

Regards
Aasawari

Hello @Aasawari sure let me provide that info:

my updated query is the following one:

pipeline = [
                {
                    "$search": {
                        "index": "content_comparer",
                        "compound": {
                            "filter": [
                                {"equals": {"path": "room_id", "value": room}},
                                {"equals": {"path": "user_id", "value": user}},
                                {"range": {"path": "created_at", "gte": time_interval}},
                            ],
                            "must": [{"phrase": {"query": message, "path": "message"}}],
                        },
                        "scoreDetails": True,
                    }
                },
                {"$limit": 5},
                {
                    "$project": {
                        "message": 1,
                        "released": 1,
                        "score": {"$meta": "searchScore"},
                        "scoreDetails": {"$meta": "searchScoreDetails"},
                    }
                },
            ]

            text_matches = self.message_collection.aggregate(pipeline)

My index:

And some text examples:

If I first type “hello my friends” and then type the following sentences I get the following matches/scores:

hello → {'_id': ObjectId('6613b91bc002c76ff1cbfc2e'), 'message': 'hello my friends', 'score': 1.8647773265838623, 'scoreDetails': {'value': 1.8647773265838623, 'description': 'sum of:', 'details': [{'value': 0.0, 'description': 'match on required clause, product of:', 'details': [{'value': 0.0, 'description': '# clause', 'details': []}, {'value': 1.0, 'description': 'ConstantScore($type:token/room_id:chatroom_darwin)', 'details': []}]}, {'value': 0.0, 'description': 'match on required clause, product of:', 'details': [{'value': 0.0, 'description': '# clause', 'details': []}, {'value': 1.0, 'description': 'ConstantScore($type:token/user_id:12345)', 'details': []}]}, {'value': 0.0, 'description': 'match on required clause, product of:', 'details': [{'value': 0.0, 'description': '# clause', 'details': []}, {'value': 1.0, 'description': 'ScoreDetailsWrapped ($type:date/created_at:[1712568535187 TO 9223372036854775807]) ScoreDetailsWrapped ($type:dateMultiple/created_at:[1712568535187 TO 9223372036854775807])', 'details': []}]}, {'value': 1.8647773265838623, 'description': '$type:string/message:hello [BM25Similarity], result of:', 'details': [{'value': 1.8647773265838623, 'description': 'score(freq=1.0), computed as boost * idf * tf from:', 'details': [{'value': 4.6314873695373535, 'description': 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:', 'details': [{'value': 1.0, 'description': 'n, number of documents containing term', 'details': []}, {'value': 153.0, 'description': 'N, total number of documents with field', 'details': []}]}, {'value': 0.4026303291320801, 'description': 'tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:', 'details': [{'value': 1.0, 'description': 'freq, occurrences of term within document', 'details': []}, {'value': 1.2000000476837158, 'description': 'k1, term saturation parameter', 'details': []}, {'value': 0.75, 'description': 'b, length normalization parameter', 'details': []}, {'value': 3.0, 'description': 'dl, length of field', 'details': []}, {'value': 2.28104567527771, 'description': 'avgdl, average length of field', 'details': []}]}]}]}]}}

“my friends” → {'_id': ObjectId('6613b91bc002c76ff1cbfc2e'), 'message': 'hello my friends', 'score': 3.387709617614746, 'scoreDetails': {'value': 3.387709617614746, 'description': 'sum of:', 'details': [{'value': 0.0, 'description': 'match on required clause, product of:', 'details': [{'value': 0.0, 'description': '# clause', 'details': []}, {'value': 1.0, 'description': 'ConstantScore($type:token/room_id:chatroom_darwin)', 'details': []}]}, {'value': 0.0, 'description': 'match on required clause, product of:', 'details': [{'value': 0.0, 'description': '# clause', 'details': []}, {'value': 1.0, 'description': 'ConstantScore($type:token/user_id:12345)', 'details': []}]}, {'value': 0.0, 'description': 'match on required clause, product of:', 'details': [{'value': 0.0, 'description': '# clause', 'details': []}, {'value': 1.0, 'description': 'ScoreDetailsWrapped ($type:date/created_at:[1712568589841 TO 9223372036854775807]) ScoreDetailsWrapped ($type:dateMultiple/created_at:[1712568589841 TO 9223372036854775807])', 'details': []}]}, {'value': 3.387709617614746, 'description': '$type:string/message:"my friends" [BM25Similarity], result of:', 'details': [{'value': 3.387709617614746, 'description': 'score(freq=1.0), computed as boost * idf * tf from:', 'details': [{'value': 8.428622245788574, 'description': 'idf, sum of:', 'details': [{'value': 3.7906620502471924, 'description': 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:', 'details': [{'value': 3.0, 'description': 'n, number of documents containing term', 'details': []}, {'value': 154.0, 'description': 'N, total number of documents with field', 'details': []}]}, {'value': 4.637959957122803, 'description': 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:', 'details': [{'value': 1.0, 'description': 'n, number of documents containing term', 'details': []}, {'value': 154.0, 'description': 'N, total number of documents with field', 'details': []}]}]}, {'value': 0.40192925930023193, 'description': 'tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:', 'details': [{'value': 1.0, 'description': 'phraseFreq=1.0', 'details': []}, {'value': 1.2000000476837158, 'description': 'k1, term saturation parameter', 'details': []}, {'value': 0.75, 'description': 'b, length normalization parameter', 'details': []}, {'value': 3.0, 'description': 'dl, length of field', 'details': []}, {'value': 2.2727272510528564, 'description': 'avgdl, average length of field', 'details': []}]}]}]}]}}

“hello my friend” → no match given and I expected one

“hella my friend” → no match given and I expected one

Basically my objective was to just detect slighty variations in the text to check for spam messages in my product chatroom. I know that might be difficult to detect and I also made some experiments with text instead of phrase but the results were too much strict.

What do you think would help me improve this query? Oh and if I only have a partial match in the
query like “hello” compared with “hello my friends” I would need to ignore it because this would cause a lot of fake spam messages.

Hi @Pedro_Silva1,

Thank you for your response.

I appreciate your clarification on the requirements. It seems you’re aiming for precise matches in your search criteria, where even slight variations shouldn’t trigger a response.

To ensure we’re aligned on this approach, I suggest reviewing the blog article on “Exact Matches in Atlas Search.” Additionally, if you could provide me with some sample data from the collection, I’d be better equipped to tailor the query to meet your specific needs.

Looking forward to assisting you further.

Best regards,
Aasawari

Hi @Pedro_Silva1 , you mentioned looking into text instead of phrase – did you try the fuzzy option? I think it might get you close to what you’re looking for.

You could also use the compound operator to combine the text and the phrase clauses.