Search a word in ms document (doc, docx) or pdf file

emiliano_colatosti · November 3, 2022, 3:44pm

Hi, I have some MS document or PDF on mongo db atlas.
I have an application in nodejs that uploads these documents on my collection using gridfs.
example:

{
  "_id": {
    "$oid": "6363a10923887993e298351b"
  },
  "length": 75454,
  "chunkSize": 261120,
  "uploadDate": {
    "$date": {
      "$numberLong": "1667473674312"
    }
  },
  "filename": "d9f260cb12e32f05427bd26603e079a2.docx",
  "contentType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
}
second example
{
  "_id": {
    "$oid": "6363a19a23887993e298351d"
  },
  "length": 414802,
  "chunkSize": 261120,
  "uploadDate": {
    "$date": {
      "$numberLong": "1667473819023"
    }
  },
  "filename": "8d5ffa4b3ed864f8453444430adb4c9c.pdf",
  "contentType": "application/pdf"

I need to search a single or multiple words (string) in my MS documents or pdf.
In order to do that may I use Atlas Search? Is it possible to do that or I have to use elasticsearch on mongodb?
thank you for your response.
Emiliano

kevinadi · November 4, 2022, 3:30am

Hi @emiliano_colatosti welcome to the community!

GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16 MB.

Notably, this is a convention rather than a specific mechanism in the server itself.

Also, specific to your question, MongoDB has no understanding of Office documents nor PDF file types, so it cannot index the contents of those files.

Best regards
Kevin

emiliano_colatosti · November 4, 2022, 8:44am

Hi Kevin,
Thank’s for your information.
I have another doubt: is it possible to store the ms file as text with “String“ data type and not as “binary data”?
If I uploaded them as “string data” would I be able to do my full text search?

THK for your kind reply
Emiliano

kevinadi · November 4, 2022, 8:54pm

is it possible to store the ms file as text with “String“ data type and not as “binary data”?

It is possible, but you would need an additional pre-processing step before inserting the data into the database.

For PDF, perhaps this topic on Super User How to extract text from pdf in script on Linux? might be useful as a starting point. For Word documents, perhaps Extract text from a Word document using VBScript will give some ideas. Note that the PDF solution is for Linux, but the Word solution is for Windows since I’m not sure what OS you are using, but I’m certain there are solutions for both OSes.

Once the text is extracted, then you can put it in a MongoDB document, perhaps:

{ _id: ...,
  title: "some title",
  text: " ... text in the document ... ",
  author: "document author",
  metadata: <some other metadata> }

Note that the schema design above is just a really oversimplified example. The actual schema that serves your needs best could look radically different. See 6 Rules of Thumb for MongoDB Schema Design and MongoDB Schema Design Best Practices for more information about this.

One thing to note is that MongoDB document has a hard limit of 16MB. If you have extremely large documents (even in text form) then you’ll need to have some strategy on how to handle those specific cases.

Hopefully this is useful as a starting point!

Best regards
Kevin

emiliano_colatosti · November 7, 2022, 8:37am

thank you so much for your reply!

system · November 13, 2022, 8:02am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.