Hi, I have some MS document or PDF on mongo db atlas.
I have an application in nodejs that uploads these documents on my collection using gridfs.
example:
I need to search a single or multiple words (string) in my MS documents or pdf.
In order to do that may I use Atlas Search? Is it possible to do that or I have to use elasticsearch on mongodb?
thank you for your response.
Emiliano
Hi Kevin,
Thank’s for your information.
I have another doubt: is it possible to store the ms file as text with “String“ data type and not as “binary data”?
If I uploaded them as “string data” would I be able to do my full text search?
is it possible to store the ms file as text with “String“ data type and not as “binary data”?
It is possible, but you would need an additional pre-processing step before inserting the data into the database.
For PDF, perhaps this topic on Super User How to extract text from pdf in script on Linux? might be useful as a starting point. For Word documents, perhaps Extract text from a Word document using VBScript will give some ideas. Note that the PDF solution is for Linux, but the Word solution is for Windows since I’m not sure what OS you are using, but I’m certain there are solutions for both OSes.
Once the text is extracted, then you can put it in a MongoDB document, perhaps:
{ _id: ...,
title: "some title",
text: " ... text in the document ... ",
author: "document author",
metadata: <some other metadata> }
One thing to note is that MongoDB document has a hard limit of 16MB. If you have extremely large documents (even in text form) then you’ll need to have some strategy on how to handle those specific cases.