Search data in upload pdf

max_devis · September 23, 2024, 6:51pm

Hi, I’m starting to use mongodb for the first time and I loaded a pdf document using C#.
I would like to know how you can search for a text in the PDF I uploaded

Br
Max

Rishit_Bhatia · September 25, 2024, 1:17pm

Hi @max_devis,

Welcome to the MongoDB Community Forums! Is all of the text from the pdf in a specific field in a document in MongoDB? You can read more about Atlas Search with the .NET/C# driver here. If you’re looking for an exact match with a keyword then you can use the Equals method as shown here. Hope that helps.

Thanks,

Rishit.

Michael_Lynn · September 25, 2024, 7:54pm

If you’ve loaded a PDF into MongoDB using C#, it’s likely stored as binary data (like GridFS or a base64-encoded string perhaps). MongoDB, by itself, doesn’t have native support to search inside binary files (like a PDF). So, to search for text inside the PDF, you’ll need to extract the text from the PDF before storing it in MongoDB… however, I’d strongly recommend you crack open that PDF, store the text / string data in MongoDB as meta data… and store the binary pdf as an object (like S3, for example).

You could use a lib like iTextSharp for this…

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;

public string ExtractTextFromPdf(string pdfPath)
{
    using (PdfReader reader = new PdfReader(pdfPath))
    {
        StringWriter text = new StringWriter();
        for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            text.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i));
        }
        return text.ToString();
    }
}

Then… you could even take this further and use Atlas Search and Vector Search to provide some amazing search capabilities for the content in your pdf’s.

how you could implement this in C#.

1. Install the `RestSharp` Library:

You’ll need a library like RestSharp to make HTTP requests in C#. You can install it via NuGet:

Install-Package RestSharp

2. Get an API Key from OpenAI:

Sign up for the OpenAI API and get your API key from OpenAI.

3. Generate Embeddings in C#:

Here’s an example of how to use C# with OpenAI’s API to generate embeddings for your PDF text:

using System;
using System.Threading.Tasks;
using RestSharp;
using Newtonsoft.Json.Linq;
using MongoDB.Driver;
using MongoDB.Bson;

public class Program
{
    // OpenAI API endpoint and your API Key
    private static readonly string openAiApiKey = "YOUR_OPENAI_API_KEY";
    private static readonly string openAiUrl = "https://api.openai.com/v1/embeddings";

    public static async Task Main(string[] args)
    {
        // MongoDB connection
        var client = new MongoClient("mongodb+srv://<username>:<password>@cluster.mongodb.net/");
        var database = client.GetDatabase("your_database");
        var collection = database.GetCollection<BsonDocument>("pdfs");

        // Fetch documents with extracted text
        var documents = collection.Find(new BsonDocument { "text_content", new BsonDocument("$exists", true) }).ToList();

        foreach (var doc in documents)
        {
            var text = doc["text_content"].ToString();
            var embedding = await GenerateEmbeddingFromOpenAI(text);

            // Store the embedding back in MongoDB
            var filter = Builders<BsonDocument>.Filter.Eq("_id", doc["_id"]);
            var update = Builders<BsonDocument>.Update.Set("embedding", embedding);
            await collection.UpdateOneAsync(filter, update);
        }
    }

    //  generate embedding using OpenAI API
    private static async Task<double[]> GenerateEmbeddingFromOpenAI(string text)
    {
        var client = new RestClient(openAiUrl);
        var request = new RestRequest(Method.POST);
        request.AddHeader("Authorization", $"Bearer {openAiApiKey}");
        request.AddHeader("Content-Type", "application/json");

        var body = new JObject
        {
            ["model"] = "text-embedding-ada-002", // A good model for embeddings - but do your research based on the content of your pdfs
            ["input"] = text
        };

        request.AddParameter("application/json", body.ToString(), ParameterType.RequestBody);

        var response = await client.ExecuteAsync(request);
        var jsonResponse = JObject.Parse(response.Content);
        var embeddings = jsonResponse["data"][0]["embedding"].ToObject<double[]>();

        return embeddings;
    }

Your data should look something like this…

{ "_id": ObjectId("..."), "filename": "document.pdf", "pdf_data": "<binary data>", // assuming you still want to store the binary data in mdb. "text_content": "Some extracted text...", "embedding": [0.12, -0.45, ..., 0.98] // This is your vector }

Next, create a Vector Search Index on your embedding field.

Now you can search using Vector Search something like this:

`
using MongoDB.Bson;
using MongoDB.Driver;
using System;
using System.Collections.Generic;
using System.Threading.Tasks;

public class VectorSearchQueryExample
{
private static readonly string connectionString = “”;

public static async Task Main(string[] args)
{
    var client = new MongoClient(connectionString);
    var database = client.GetDatabase("sample_mflix");
    var collection = database.GetCollection<BsonDocument>("embedded_movies");

    // Define the query vector (e.g., an embedding from a new document or user input)
    double[] queryEmbedding = { 0.12, -0.45, ..., 0.98 };  // Replace with the actual embedding

    // Create the $vectorSearch stage using the new syntax
    var vectorSearchStage = new BsonDocument
    {
        { "$vectorSearch", new BsonDocument
            {
                { "vector", new BsonArray(queryEmbedding) },
                { "path", "plot_embedding" }, // The field where embeddings are stored
                { "k", 5 },  // Number of nearest neighbors to return
                { "similarity", "euclidean" }  // Similarity measure (euclidean or cosine)
            }
        }
    };

    // Perform the vector search query
    var pipeline = new List<BsonDocument> { vectorSearchStage };
    var result = await collection.Aggregate<BsonDocument>(pipeline).ToListAsync();

    // Print the results
    foreach (var doc in result)
    {
        Console.WriteLine(doc.ToJson());
    }
}

}
`
Hope this helps… let us know how you make out.

Search data in upload pdf

1. Install the RestSharp Library:

2. Get an API Key from OpenAI:

3. Generate Embeddings in C#:

1. Install the `RestSharp` Library: