Vector search is not really semantic

Maor_Bolokan · January 25, 2024, 11:39am

So i have used the sample_airbnb data, i created a big text field from the “reviews” collection.
sent it to openAI embedding to get the plot_embeddings .
Saved it in DB.

Created my search index .

Now , i cretaed a simple query using the same openai modal “Rooms that have air-condition”
And the results that is coming back is really bad.

for example stuff that i got back:

. This place was very clean and convenient. Extremely close to the subway. It’s a little warm as it was the beginning of summer and there is no air conditioner in the apartment but with the windows open it was manageable.

It state specific that there is no air condition in the rooms, so why did he find it ?

i tested it with cosine and euclidean engines , and got the same (bad) results.

I was wondering, if all what he do is just find similary in the embedings without any context of what the user asked ?

Thanks !

Benjamin_Flast · January 28, 2024, 7:32pm

Hey @Maor_Bolokan welcome to the community!

What you’re seeing here is a common challenge that developers face when utilizing vector search on their data. It is often not as simple as vectorizing an entire chunk of text and doing vector search against it. You often need to chunk text into smaller pieces that have strong semantic relevance before they are then vectorized. Different models will perform differently depending on what they were trained on. Lastly, it’s worth mentioning that Atlas Vector Search supports metadata filtering so where you want to explicitly filter on certain values that may be a better option.

Maor_Bolokan · January 30, 2024, 1:42pm

thanks, but then what’s the difference between this and just doing full text search for the word “air condition” , maybe i miss something …

Prakul_Agarwal · January 31, 2024, 7:25am

Hello @Maor_Bolokan , With vector search what you are performing is an Approximate k nearest neighbors search. When you search for “Rooms that have air-condition” the vector search will return the k ‘nearest’ documents to your query (here the parameter k is what you specified as limit in your $vectorSearch syntax), within all the documents that are in your vector search index.
The vector search query will return the top k documents that match your query, and these results are ‘approximate’. Suppose you had only k docs in your index, the vector search query will return those k docs. But these returned results come with a ‘similarity score’.

How many documents have you embedded, ie what is the size of your vector search index? More the documents that you have in your index the better.
You can check the semantic similarity score between your query and the returned results using $project with “score”: { “$meta”: “vectorSearchScore” } in your agg pipeline. more here
Higher the score the better, and you will want to discard the results with a score lower than a threshold than you would choose

 db.<collection>.aggregate([
 {
     "$vectorSearch": {
       <query-syntax>
     }
   },
   {
     "$project": {
       "<field-to-include>": 1,
       "<field-to-exclude>": 0,
       "score": { "$meta": "vectorSearchScore" }
     }
   }
 ])

Let me know if this helps clarify your issue.

Maor_Bolokan · January 31, 2024, 9:39am

Thanks, i will keep playing with it and let the community know.
thanks for your support.